101
Investigating the Reliability of Those Who Provide (and Those Who Interpret) Eyewitness Confidence Statements Jesse Howard Grabman Charlottesville, Virginia BA, University of Virginia, 2013 A Predissertation Research Project presented to the Graduate Faculty of the University of Virginia in Candidacy for the Degree of Master of Arts Department of Psychology University of Virginia December, 2019 Readers: Dr. Chad S. Dodson Dr. James P. Morris

Investigating the Reliability of Those Who Provide (and ......Grabman, J. H., & Dodson, C. S. (2019). Prior knowledge influences interpretations of Prior knowledge influences interpretations

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Investigating the Reliability of Those Who Provide (and Those Who Interpret)

    Eyewitness Confidence Statements

    Jesse Howard Grabman

    Charlottesville, Virginia

    BA, University of Virginia, 2013

    A Predissertation Research Project presented to the

    Graduate Faculty of the University of Virginia

    in Candidacy for the Degree of Master of Arts

    Department of Psychology

    University of Virginia

    December, 2019

    Readers:

    Dr. Chad S. Dodson

    Dr. James P. Morris

  • GRABMAN 1

    Introduction

    On the morning of May 7, 2000, 15-year old Brenton Butler was walking to retrieve a job

    application from the local Blockbuster video. Two hours earlier, a ‘skinny black male’

    approached Mary and James Stephens outside their hotel and demanded Mary’s purse. Standing

    about three feet from the couple, the man pulled out a pistol and shot Mary dead before running

    away. Two police officers saw Butler and pulled him aside thinking he vaguely matched the

    perpetrator’s description. As Butler talked to a detective, from fifty-feet away James Stephens

    indicated that this was the teenager who shot his wife. Taken aback, the officers brought

    Stephens closer, and he confirmed that “he was sure of it, he would not put an innocent man in

    jail” (De Lestrade, 2001). Butler was tried as an adult based on this eyewitness testimony, and

    later acquitted due to investigators coercing him into a false confession. Ultimately, forensic

    evidence proved a different man committed the crime.

    Judges in the United States are advised to use certainty as an indicator of eyewitness

    reliability (Neil vs. Biggers, 1972). And, increasing evidence shows that high confidence at the

    time of the initial identification is a strong predictor of accuracy, so long as proper lineup

    administration procedures are followed (Wixted & Wells, 2017). This strong relationship

    between high confidence and accuracy is documented in many laboratory studies, using a variety

    of manipulations (e.g. weapon vs. no weapon, other-race identifications) and stimuli (e.g.,

    identifications after viewing photos of faces, videos, and/or staged crimes). Moreover, a recent

    field study suggests that these findings extend to real-world identifications (Wixted, Mickes,

    Dunn, Clark, & Wells, 2016).

    However, as the Butler case demonstrates, high eyewitness confidence is not always

    reliable. In this thesis, I present research from our lab that raises important caveats to the

  • GRABMAN 2

    growing consensus about a strong relationship between eyewitness confidence and accuracy.

    This includes lightly adapted versions of two published first-authored articles (Grabman,

    Dobolyi, Berelovich, & Dodson, 2019; Grabman & Dodson, 2019), as well as results from a

    recently submitted first-authored manuscript.

    Part I shows that individual differences in face recognition ability influence the rate of

    high confidence errors. Specifically, weaker face recognition ability corresponds to increased

    rates of high confidence errors in both a controlled eyewitness experiment using criminal lineups

    (Study 1A), and in an uncontrolled ‘real-world’ face recognition task of actors from the popular

    television show Game of Thrones (Study 1B). Part II shows that the probative value of

    eyewitness confidence statements depends on evaluators (e.g., police officers, judges, jurors)

    properly interpreting the level of certainty the witness intended to convey. In three experiments

    (Study 2A – C), participants systematically misinterpreted witnesses’ verbal confidence

    statements when they knew the identity of the suspect in a criminal lineup – a situation that is

    common in criminal justice decisions. Taken together, these studies suggest a degree of caution is

    warranted when using eyewitness confidence as an indicator of accuracy.

    Introduction References

    De Lestrade, J. X. (2001). Murder on a Sunday Morning.

    Grabman, J. H., Dobolyi, D. G., Berelovich, N. L., & Dodson, C. S. (2019). Predicting High

    Confidence Errors in Eyewitness Memory: The Role of Face Recognition Ability, Decision-

    Time, and Justifications. Journal of Applied Research in Memory and Cognition, 8(2), 233–

    243. https://doi.org/10.1016/j.jarmac.2019.02.002

    Grabman, J. H., & Dodson, C. S. (2019). Prior knowledge influences interpretations of

    eyewitness confidence statements: ‘The witness picked the suspect, they must be 100%

    sure’. Psychology, Crime and Law, 25(1), 50–68.

    https://doi.org/10.1080/1068316X.2018.1497167

    Wixted, J. T., Mickes, L., Dunn, J. C., Clark, S. E., & Wells, W. (2016). Estimating the reliability

    of eyewitness identifications from police lineups. Proceedings of the National Academy of

    Sciences, 113(2), 304–309. https://doi.org/10.1073/pnas.1516814112

    Wixted, J. T., & Wells, G. L. (2017). The Relationship Between Eyewitness Confidence and

    Identification Accuracy: A New Synthesis. Psychological Science in the Public Interest,

    18(1), 10–65. https://doi.org/10.1177/1529100616686966

  • Part I: Investigating the influence of face recognition ability

    on the confidence-accuracy relationship in eyewitness

    memory.

  • GRABMAN 4

    Study 1A: Predicting High Confidence Errors in Eyewitness Memory: The Role of Face

    Recognition Ability, Decision-Time, and Justifications (Grabman et al., 2019)

    How confident can we be about eyewitness confidence? A growing consensus suggests

    that identifications by highly confident witnesses are generally accurate (Wixted & Wells, 2017).

    However, the question is whether there are variables that systematically influence the accuracy of

    high confidence identifications. In the sections that follow we briefly review research on three

    factors that form the foundation of the first study: (a) the speed of a lineup identification, (b) the

    basis for an identification from a lineup, and (c) face recognition ability. We focus primarily on

    face recognition ability as no one (to our knowledge) has investigated the influence of this factor

    on high confidence misidentifications.

    Many studies find that lineup-identification accuracy worsens as decision-times increase

    when individuals choose a face from a lineup, though this association is weaker for non-

    identifications (e.g., Brewer & Wells, 2006; Dobolyi & Dodson, 2018; Dodson & Dobolyi, 2016;

    Dunning & Stern, 1994; Sauer, Brewer, Zweck, & Weber, 2010). But, growing evidence shows

    that high confidence errors also change as a function of the speed of lineup decisions. For

    example, Sauerland and Sporer (2009) found that confident (90 -100%) and fast (< 6s)

    identifications produced greater identification accuracy (97.1%) than confident, but slow,

    identifications (60.4%) (for similar results, see Brewer & Wells, 2006). Similarly, modeling

    decision-times continuously, Dodson and Dobolyi (2016) observed that accuracy greatly

    diminished for highly confident responses (100%) as decision-times increased. Taken together,

    these results suggest that, even under pristine lineup administration conditions, highly confident

    identifications may be reliable only insofar as the decision is made quickly.

  • GRABMAN 5

    In addition to decision-time, highly confident eyewitnesses can differ in the basis for their

    identification of someone from a lineup. In the only study to examine this issue, Dobolyi and

    Dodson (2018) asked individuals to justify their level of confidence in a response to a lineup. A

    content analysis showed that nearly 50% of all lineup-identifications were justified by referring

    to a single or multiple observable features about the suspect (e.g., “I remember his eyes and

    nose”). Moreover, 20% of all identifications were accompanied by a reference to familiarity

    (e.g., “He’s familiar”), with the remaining identifications based on either an expression of

    recognition (e.g., “I recognize him”) or a reference to an unobservable feature (e.g., “He looks

    like my cousin”) or a mixture of these justification-types. For the present purposes, the key point

    is that high confidence misidentifications increased when identifications referenced familiarity as

    compared to the other justification types. However, the period between encoding and test was

    short (5-minutes), meaning that it is unclear whether this relationship holds for longer delays.

    Finally, research conclusions about the confidence-accuracy relationship are currently

    based on and apply to the average individual. This focus on the average person, however,

    neglects individual differences which may account for some of the high-confidence errors that

    appear even when investigators follow proper procedures. The ability to recognize unfamiliar

    faces varies considerably from person to person (see Wilmer, 2017 for review). At the low end

    are those with prosopagnosia (‘face-blindness’), while other individuals exhibit exceptional skill

    (‘super-recognizers’) (Ramon, Bobak, & White, 2019; Russell, Yue, Nakayama, & Tootell, 2010;

    Wan et al., 2017). Face recognition ability is highly heritable (Wilmer et al., 2010; Zhu et al.,

    2010) and distinct from other cognitive markers such as verbal and visual recognition ability, and

    general intelligence (e.g., for reviews, see Wilmer, 2017; Wilmer et al., 2012).

  • GRABMAN 6

    Although a few studies have shown that measures of face recognition predict eyewitness

    identification performance (Andersen, Carlson, Carlson, & Gronlund, 2014; Bindemann,

    Avetisyan, & Rakow, 2012; Morgan et al., 2007), no one has examined how heterogeneity in face

    recognition ability impacts the rate of high confidence misidentifications. One hypothesis about

    this relationship stems from Deffenbacher’s (1980) optimality account, which holds that

    confidence will be a stronger predictor of accuracy under more than less ideal conditions at

    encoding, storage and retrieval. By this account, face recognition ability should influence the

    quality (optimality) of what is encoded and retrieved, which in turn will influence the

    relationship between confidence and accuracy. In short, poor face recognizers should be more

    prone than strong face recognizers to make high confidence misidentifications. Alternatively,

    Semmler, Dunn, Mickes, and Wixted’s (2018) constant likelihood ratio account argues that,

    regardless of changes in overall accuracy, people assign confidence ratings so as to maintain the

    relationship between confidence and accuracy. Even though poor face recognizers will show

    worse accuracy than strong face recognizers, this account argues that there will be few changes

    in the predictive value of confidence – a high confidence identification will be comparably

    accurate across all levels of face recognition ability.

    In sum, the purpose of this study is to investigate factors that potentially increase the rate

    of high confidence misidentifications, namely (a) decision-time, (b) justifications, and (c) face

    recognition ability. We examine these variables in concert with two other forensically relevant

    factors: the other-race effect (e.g., Meissner & Brigham, 2001) and retention interval (Wixted,

    Read, & Lindsay, 2016).

  • GRABMAN 7

    Methods

    Participants

    The study was administered online on respondents’ personal laptop or desktop computers

    using Amazon’s Mechanical Turk (mTurk). The 569 participants comprising the results ranged in

    age from 18 to 50 years (M = 31.66, SD = 6.08), were primarily female (68.5%), and all self-

    reported their race as White/Caucasian. Though no consensus standards are available for a-priori

    power estimates for mixed effects logistic regression models, this sample size was deemed

    sufficient in light of conservative recommendations of 50 responses per modeled variable (Van

    Der Ploeg, Austin, & Steyerberg, 2014), and findings that estimates are generally reliable for

    sample sizes greater than 30 with at least 10 responses per participant (McNeish & Stapleton,

    2016). All participants received payment for completing the study. The University of Virginia

    Institutional Review Board approved this research.

    Materials

    Lineups. Participants viewed the same six Black and six White lineups as used in Dobolyi

    & Dodson (2013, 2018). These lineups consisted of a formal “head and shoulders” photograph of

    six individuals arranged in a 2 x 3 grid, wearing a maroon colored t-shirt, and exhibiting neutral

    facial expressions (see Figure 1A.1 for an example). All lineups met the criteria that no face is

    substantially more likely to be chosen by a naïve viewer based on a description of the perpetrator

    (i.e. lineups were ‘fair’; see Dobolyi & Dodson, 2013 for more details on lineup generation). To

    avoid a simple picture-matching strategy, at encoding participants saw different photos of

    potential lineup targets wearing varied street clothing and casual expressions (e.g., ‘smiling’).

  • GRABMAN 8

    Figure 1A.1. Example of the identification task. Participants’ task was to select the person from

    the encoding phase, or to indicate that they were “Not Present” in the lineup.

    Face Recognition Task. We administered the Cambridge Face Memory Test (CFMT)

    (Duchaine & Nakayama, 2006) to assess participants’ face recognition ability. In this task,

    respondents attempt to memorize six faces in three separate orientations. For each trial,

    previously viewed faces must be selected from an array of the target face and two foils. The test

    phase proceeds across 72 trials in three increasingly difficult blocks. Past research shows that a

    simple sum of correct responses is a reliable indicator of poor to above average recognition

    ability, with performance ranging from 0-72 correct selections (Cho et al., 2015). Figure 1A.2

    shows the distribution of CFMT scores from the present study.

  • GRABMAN 9

    Figure 1A.2. Distribution of CFMT score for 569 participants in the study. The blue line represents

    the median score (Median = 61), while the faded area surrounding represents ± 1 Median

    Absolute Deviations (MAD = 8.9).

    Procedure

    Procedurally, the study is similar to Dobolyi & Dodson (2018), except for two key

    differences. First, all participants completed the CFMT at the end of the lineup memory task.

    Second, we assigned roughly half of participants (n = 277) to a 5-minute delay between the

    encoding and test phases, while the remaining participants were tested a day later (n = 292).

    Prior to the encoding phase, we instructed participants that they would “see a series of faces.

    These faces will repeat 3 times. Please pay close attention because after a delay we will ask you

    questions about who you saw.” We further informed them that some participants would be

    randomly assigned to a 5-minute delay, whereas others would be prompted to return after a one-

    day delay. As an attention check, before showing the stimuli we asked, “how many times will the

    faces repeat?” Those responding anything other than ‘3’ were asked to reread the instructions.

  • GRABMAN 10

    Failing this check a second time resulted in termination of study procedures (9 participants failed

    this check and are not included in the results or summary statistics).

    After passing the check, participants viewed six Black and six White faces as a block

    three times in a randomized order. This order followed the stipulations that: 1) The same face

    would not appear at the end of one block and begin the subsequent block (i.e., none would be

    shown ‘back to back’) and 2) faces of the same race would be shown a maximum of two

    consecutive times. Faces appeared for three seconds with a one second interstimulus interval.

    Additionally, to control for primacy and recency effects, four filler faces (two Black, two White)

    appeared at both the beginning and end of the encoding phase, but did not appear during the test

    phase.

    Participants completed the lineup task after either five minutes of working on an online

    word search, or roughly one day later upon seeing the prompt to begin the next phase of the

    experiment (see Figure 1A.1 for an example of the task). We instructed them that they would see

    a series of lineups where a single face they viewed previously may or may not be present. Their

    task was either to identify the face they remembered from before, or to indicate that they did not

    recognize any of the faces in the lineup by selecting ‘not present’.

    After making their selection, we asked participants, “in their own words, [to] please

    explain how certain [they] are in [their] response” by typing into a text box. This was followed

    by a prompt to “please provide specific details about why” they made this expression of

    certainty. Finally, we asked them to indicate their confidence using a 6-point scale ranging from

    0% (not at all certain) to 100% (completely certain) in 20% point increments.

    To check comprehension, and to demonstrate the task, we asked participants to pretend

    that they viewed a particular yellow smiley face. We then immediately presented a lineup of six

  • GRABMAN 11

    colorful smiley faces. Only those who correctly selected the yellow smiley face proceeded to the

    test lineups, after reading “that previously viewed faces may look different in their lineup

    mugshots. This can be due to changes in lighting, clothing, facial hair, and/or other reasons” (33

    participants failed this check and are not included in the results or summary statistics).

    In the test phase, half of the lineups (3 Black, 3 White) contained an individual viewed during

    encoding (i.e. ‘target present’; TP), whereas the other half replaced this face with another person

    closely matched on descriptive characteristics (i.e. ‘target absent’; TA). Each lineup served as

    either a TP or TA lineup depending on its randomly assigned counterbalancing condition. One of

    two predetermined lineup presentation orders were randomly assigned to each participant, with

    both following the criteria that 1) no more than two TP/TA lineups appeared consecutively, 2) no

    more than two lineups of the same race appeared consecutively, and 3) lineups appeared in

    different serial position across the two presentation orders. Finally, after finishing the lineups,

    participants completed the CFMT, followed by a short demographic survey that included

    questions on race, age, and sex.

    Results

    Data Preparation

    The dataset is comprised of 7,248 lineup responses (12 lineups/participant x 604

    participants), and is available on the Open Science Framework (OSF) (https://osf.io/j25yc). We

    divided the data into six roughly equal-sized groups of participants, and assigned each group to

    two research assistants to code justifications for lineup responses. The coding scheme was nearly

    identical to Dobolyi & Dodson (2018), categorizing justifications based on familiarity (F; e.g.,

    “he looks familiar.”), single observable feature (O; e.g., “I remember his nose.”), multiple

    observable features (Omany; e.g., ‘I remember his nose and eyes.’), single unobservable feature

  • GRABMAN 12

    (U; e.g., ‘he looks like my cousin.’), multiple unobservable features (Umany; e.g. ‘He looks like

    my cousin, and another guy I know.’), and recognition (R; e.g., ‘I recall seeing this guy before.’).

    However, whereas Dobolyi & Dodson (2018) assigned combinations of justification types into a

    general ‘mixed’ category, we coded these responses into categories representing either familiarity

    + observable (FO; e.g., ‘his nose looks familiar’), or observable + unobservable (OU; e.g., ‘my

    friend’s eyes look like that’). The coding scheme for ‘not present’ responses is the same as for

    identifications, except that statements referred to the absence of a justification category, such as

    ‘none of the faces look familiar’ (coded as F) or ‘I don’t recognize any of them’ (coded as R).

    Statements that did not fit any category were coded as unknown.

    Overall interrater agreement was high, with matching categorizations for 80.5% of lineup

    justifications. Across the pairs of raters, agreement ranged from 71.6% - 85.5%, with Cohen’s

    Kappas indicating acceptable agreement across coders (range Cohen’s κ = .66 - .83). To

    maximize the number of available responses, a third research assistant (masked to the other

    raters’ categorizations) coded statements where there was disagreement. We accepted any

    categorizations where at least two out of the three raters agreed on the statement. Due to the

    cross-race manipulation, we removed 20 participants who did not self-report their race as

    White/Caucasian. Additionally, we removed 15 participants based on not providing any

    justifications (N = 1), giving the same justification for all 12 lineups (e.g., “it was the same face

    as before”; N = 11), or providing nonsensical answers (e.g., “they’re all white guys wearing the

    same t-shirt”; N = 3).

    As we planned on investigating decision-times in several analyses, we log transformed

    decision-times for each lineup, and calculated a median absolute deviation score. We removed

    decision-times shorter than .100 ms (n = 14 responses), as well as responses longer than 3

  • GRABMAN 13

    deviations above the median (roughly one minute) (n = 183 responses). We then eliminated

    responses where justifications could not be categorized (n = 845 responses). We also observed

    minimal numbers of OU (n = 27 responses) and Umany (n = 8 responses) categorizations,

    therefore we did not analyze these trials. Finally, we noticed many respondents mentioned that

    one of the Black target faces resembled a celebrity in the news during the experiment. Given that

    the study aims to examine responses to unfamiliar faces, this would be a major confound, and we

    removed responses to this lineup (n = 491 responses). In total, we examined 5,272 responses

    from 569 participants.

    Table 1A.1 provides a breakdown of the frequency of justifications across confidence

    levels for chooser responses (i.e., selecting a face from the TP or TA lineup) and non-chooser

    responses (i.e., responding ‘not present’). Justifications for chooser decisions most frequently

    referenced one or more observable features, either in the context of familiarity with these

    features (FO = 10.7%), or otherwise (O1 + Omany = 31.7%). In contrast, non-chooser decisions

    most commonly referred to not recognizing any faces in the lineup (R = 65.1%) or that faces

    were unfamiliar (F = 31.9%).

    We analyzed chooser responses and non-chooser responses with separate models because

    the infrequent use of many of the justification-types for non-chooser responses meant that it was

    impracticable to use the same model for both response-types. For each model of the ‘chooser’

    and ‘non-chooser’ data, we used multi-model comparisons (Burnham & Anderson, 2002) to

    obtain the best generalized linear mixed effects model among the fixed factors: Justification

    Type, Lineup Race (Same Race, Other Race), Delay (5 minute, Day), Confidence, Decision-time

    and CFMT score. Participant ID served as a random intercept. Continuous predictors

    (confidence, decision-time, CFMT) were centered and scaled prior to model fitting.

  • GRABMAN 14

    Confidence

    Response Lineup Race

    Justification 0 20 40 60 80 100 Total

    Chooser Same Race

    F 14 92 90 86 49 14 345

    FO 7 42 53 49 25 6 182

    O1 2 31 47 55 80 68 283

    Omany 1 7 23 45 55 42 173

    R 13 60 66 87 80 100 406

    U1 0 3 8 21 22 35 89

    Other Race

    F 13 97 88 71 56 10 335

    FO 2 28 26 32 18 6 112

    O1 1 22 41 56 53 58 231

    Omany 2 14 28 49 41 50 184

    R 10 48 59 66 66 95 344

    U1 0 5 5 9 18 26 63

    Total 65 449 534 626 563 510 2747

    Non-Chooser

    Same Race

    F 31 78 84 109 109 39 450

    FO 1 1 1 3 1 0 7

    O1 0 4 2 3 5 3 17

    Omany 0 1 0 4 4 2 11

    R 51 118 170 220 230 126 915

    U1 0 1 0 1 1 0 3

    Other Race

    F 24 39 82 99 79 33 356

    FO 0 0 3 0 2 2 7

    O1 0 1 2 8 4 6 21

    Omany 0 0 1 0 3 1 5

    R 73 109 120 176 168 83 729

    U1 0 0 0 1 2 1 4

    Total 180 352 465 624 608 296 2525

    Table 1A.1. Frequency of responses in the intersection of lineup race, justification type, and

    confidence level for both Chooser and Non-Chooser decisions.

  • GRABMAN 15

    To begin, we started by fitting full 6-way, 5-way, 4-way, 3-way, 2-way, and main effects

    models using the lme4 package (Bates, Maechler, Bolker, & Walker, 2014, version 1.1-21) in R

    v.3.5.1 (R Core Team, 2018). Next, a backward stepwise elimination procedure based on

    Akaike’s Information Criterion (AIC) selected the most parsimonious model from each start

    point. This method removed model terms that demonstrated any improvement in AIC, so long as

    this did not violate principles of marginality (e.g. a two-way term could not be dropped if it was

    nested in a higher three-way term). We then selected the best fitting of these reduced models as

    determined by AIC. Significance testing was performed on final model terms using likelihood

    ratio tests calculated by the afex package (Singmann, Bolker, Westfall, & Aust, 2018, version

    0.21-2). The effects package (Fox, 2003, version 4.0-2) computed model estimates and 95%

    confidence intervals.

    Finally, while there are no consensus standards for assessing absolute fits for generalized

    linear mixed effects models, we examined fits for final models using three methods. First, we

    used the DHARMa package (Hartig, 2018, version 0.2.0) to perform Kolmogorov-Smirnov

    goodness-of-fit tests (KS tests), comparing the observed data to a cumulative distribution of

    1,000 simulations from model estimates. Second, we examined residual plots based on

    deviations between simulated and observed values to check for signs of model misspecification

    (i.e., ensuring errors are uniformly distributed for each predicted value). And third, we calculated

    marginal pseudo-R2 (R2GLMM(m)) for fixed-effects, using the MuMIn package (Barton, 2018,

    version 1.42.1; see also Nakagawa & Schielzeth, 2013). This statistic includes variance

    accounted for by fixed effects in the model, while partialing out variance from the random effect

    structure (i.e., participant intercept).

  • GRABMAN 16

    Chooser model.

    We sought to include as much data as possible in the analysis of identification accuracy

    and so, following Dobolyi and Dodson (2018), we modeled this score as the rate of correct

    identifications from target-present lineups (TPc) relative to the sum of this score and the rates of

    foil identifications from target-present (TPfa) and target-absent (TAfa) lineups (i.e.,

    TPc/[TPc+TPfa+TAfa]).

    Written in Wilkinson-Rodgers (1973) notation, the best-fitting model of identification

    accuracy consists of several main effects and two-way interactions: Accuracy ~ LineupRace +

    Confidence + Delay + DecisionTime + CFMT + Justification + Confidence:LineupRace +

    Confidence:Delay + Confidence:DecisionTime + Confidence:CFMT + Confidence:Justification

    + DecisionTime:CFMT + DecisionTime:Justification + CFMT:Justification + (1|Participant). The

    absolute fit indices indicate that this model adequately fit the data (KS D = .017, p = .410;

    pseudo-R2GLMM(m) = .365), as did visual inspection of the residual plots.

    Likelihood ratio tests showed significant main effects of lineup-race, χ2(1) = 6.08, p

    = .014, delay, χ2(1) = 11.75, p = .001, confidence, χ2(1) = 20.20, p < .001, face-recognition

    ability (i.e., CFMT score), χ2(1) = 20.96, p < .001, and justification-type, χ2(1) = 14.49, p = .013.

    The effect of delay reflects higher accuracy in the 5-minute (44.4%, 95% CI [39.6, 49.2])

    compared to the one-day condition (33.4%, 95% CI [29.4, 37.7]). Other significant effects were

    all moderated by two-way interactions, which we describe below. The main effect of Decision-

    time (p = .294), and the interactions between Confidence and Delay (p = .096), Decision-time

    and CFMT (p = .155), and CFMT and Justification (p = .054) are non-significant. The four

    panels in Figure 1A.3 show how identification accuracy changes as a function of both the

    participant’s level of confidence in their identification and (a) their face recognition ability

  • GRABMAN 17

    (CFMT score), (b) their decision-time, (c) the lineup-race and (d) the justification for their

    decision, respectively. In each of these figures, the lines represent the mixed-effects model’s

    estimates, with the shading representing the 95% confidence interval.

    Figure. 1A.3. Two-way interactions between Confidence and (A) CFMT, (B) Decision-time, (C)

    Lineup Race, and (D) Justification type in the chooser model. Lines represent model estimates,

    with error shading representing the 95% confidence interval. Notably, high confidence errors are

    more pronounced when participants are worse face recognizers (A), take longer to make a

    decision (B), and/or use F/FO as the basis for selecting a face (D).

    Figure 1A.3a shows the interaction between face recognition ability (CFMT score) and

    confidence, χ2(1) = 4.54, p = .033. Poor face recognizers (i.e., individuals with lower CFMT

    scores) are less able than strong face recognizers to use confidence ratings to distinguish between

  • GRABMAN 18

    correct and incorrect identifications. But, the result that we want to emphasize involves high

    confidence responses. Figure 1A.3a clearly shows that when individuals are 100% confident in

    their identification there is a drop-off in accuracy with steadily decreasing CFMT scores. Poor

    face recognizers are much more prone to make high confidence misidentifications than are

    strong face recognizers.

    Figure 1A.3b shows that relatively fast and highly confident identifications are more

    accurate than slower and less confident identifications, replicating past research (Dodson &

    Dobolyi, 2016; Sauerland & Sporer, 2007, 2009). But, the interaction between Decision-time and

    Confidence, χ2(1) = 17.48, p < .001, reflects the strong increase in high confidence errors that

    occurs with longer decision times. Although the highest confidence responses (i.e., the solid red

    line in Figure 1A.3b) are close to 100% accurate when they occur within a few seconds, the

    accuracy of these highest confidence identifications decreases to roughly 50% when decision-

    time is delayed to 20s. There is no comparable drop off in accuracy with increasing decision-

    time for moderate to low confidence responses. Essentially, highly confident but slow

    identifications are vulnerable to being wrong.

    The interaction between confidence and lineup-race is shown in Figure 1A.3c, χ2(1) =

    6.12, p = .013. Identification accuracy is worse for cross-race than same-race lineups when

    individuals are of moderate to low confidence in their identification than when they are highly

    confident – an effect that is consistent with past studies (e.g., Dodson & Dobolyi, 2016; Nguyen

    & Pezdek, 2017; Wixted & Wells, 2017). Put another way, highly confident identifications are

    less influenced by the cross-race effect.

    Figure 1A.3d shows that identification accuracy depends on both confidence and the

    justification for the identification, as reflected by the interaction between these factors, χ2(5) =

  • GRABMAN 19

    28.14, p < .001. Consistent with Dobolyi & Dodson (2018), there is a stronger relationship

    between confidence and accuracy –shown by a steeper line in Figure 1A.3d – when individuals

    refer to observable (O1 + Omany; e.g., I remember his eyes) or unobservable (U1; e.g., He looks

    like my cousin) features about the suspect than when they refer to familiarity (F; e.g., He’s

    familiar). Moreover, there are more high confidence errors when individuals provide a

    familiarity (F) or a familiarity-observable justification (FO, e.g., His chin is familiar) than when

    they provide any of the other justification-types.

    Finally, Figure 1A.4 shows that the predictive value of the different justification-types is

    stronger at faster than at slower decision-times, as reflected by the interaction between decision-

    time and justification-type, χ2(5) = 12.01, p = .035. For clarity, we removed the Unobservable

    (U1) category from the figure because of the lack of data at the longer decision-times for this

    justification. References to many observable features (Omany) are associated with identifications

    that are over 80% accurate when the identification is made quickly. But, as seen in Figure 1A.4,

    the accuracy associated with this justification-type drops below 40% when this identification is

    made slowly (> 10 s).

    Figure. 1A.4. Interaction pattern between

    Decision-time and Justification type. Lines

    represent model estimates, with error shading

    representing the 95% confidence interval.

    Discerning accuracy seems to be more useful

    for fast responses than slow responses, where

    there is little differentiation between the

    justification types.

  • GRABMAN 20

    Non-Chooser model.

    Non-chooser accuracy is modeled as the

    rate of correct rejections from target-absent

    lineups (TAc), relative to the sum of this score

    and the number of incorrect rejections from

    target-present lineups (‘miss’; TPm) (i.e., (i.e.,

    TAc/[TAc+TPm]). As shown in Table 1A.1,

    nearly all justifications (97.0%) for a Not Present

    response were based on the lack of either

    Familiarity (F) or Recognition (R), consistent

    with Dobolyi & Dodson (2018). Consequently,

    our modeling analysis consisted of these two

    justification-types as there is too little data to

    include the other justification-types.

    The best-fitting model of non-chooser

    accuracy is represented in Wilkinson-Rodgers

    notation as: Accuracy ~ LineupRace +

    Confidence + Delay + DecisionTime + CFMT +

    Justification + Confidence:CFMT +

    DecisionTime:CFMT + (1|Participant). Visual

    inspection of the residual plots and KS tests

    showed that this model fit the data (KS D = .014,

    p = .758). However, the marginal pseudo-R2 was

    Figure. 1A.5. A) Confidence and B) CFMT

    main effects on non-chooser accuracy. Lines

    represent model estimates, with error

    shading representing the 95% confidence

    interval. Notably, performance improves

    with higher levels of confidence, and greater

    face recognition ability.

  • GRABMAN 21

    considerably lower than in the Chooser model (pseudo-R2GLMM(m) = .019). Given that our relative

    fit measure (i.e., AIC) and two out of three absolute fit indices supported proper model

    specification, we proceeded with this non-chooser model.

    We found the expected relationship between delay and accuracy, with participants

    exhibiting higher accuracy in the 5-minute condition (66.5%, 95% CI [63.7, 69.1]) than the one-

    day condition (62.2, 95% CI [59.4, 64.9]), χ2(1) = 4.78, p = .029.

    Additionally, non-chooser

    accuracy improved as participants

    expressed more Confidence, χ2(1) =

    18.20, p < .001. As presented in

    Figure 1A.5, accuracy steadily rises

    as confidence increases, improving

    by nearly 15% from 0% to 100%

    confidence. This finding conflicts

    with multiple previous studies

    examining confidence and non-

    chooser accuracy (e.g., Dobolyi &

    Dodson, 2018; Sauerland & Sporer,

    2009). We speculate on the reasons

    for this discrepancy in the Study 1A

    Discussion.

    Fig. 1A.6. Two-way interaction between decision-time

    and CFMT score. Lines represent model estimates for

    the 0-25th, 25-50th, 50-75th, and >75th percentiles of

    CFMT performance. Error shading represents the 95%

    confidence interval. Performance is comparable across

    face recognition ability for fast decisions, but poor face

    recognizers show worse accuracy over time.

  • GRABMAN 22

    The main effect of CFMT, χ2(1) = 10.30, p = .001, reflects improved non-chooser

    accuracy with stronger face recognition ability. As shown in Figure 1A.5, those with the median

    CFMT score (i.e., 61) show worse non-chooser performance (~65%) than do those with scores

    only one median deviation higher (i.e., 70) (~68%). However, this finding is qualified by a weak

    interaction between face recognition ability and decision-time, χ2(1) = 4.58, p = .032. This

    interaction suggests that performance is comparable across face recognition ability for quick

    decisions, but poorer recognizers show worse accuracy with increasing decision-time (see Figure

    1A.6).

    Finally, we found a significant main effect of justification category, χ2(1) = 4.41, p = .036.

    Familiarity-based rejections (67.3%, 95% CI [63.9, 70.4]) were more accurate than were those

    based on recognition (62.9%, 95% CI [60.5, 65.2]), although numerically the size of this

    difference is small. The main effect of decision-time (p = .137) and the interaction between

    confidence and CFMT (p = .091) are both non-significant.

    Suspect-Id Model

    Mickes (2015; see also Wixted & Wells, 2017) has argued that identification accuracy

    should be measured as the rate of correct identifications relative to the sum of this value and foil

    identifications from target-absent lineups – a score known as suspect ID accuracy (i.e.,

    TPc/[TPc+(TAfa/6)] for fair lineups). The reason why responses to foils from target-present

    lineups (TPfa) are excluded in suspect-ID accuracy is because police know that target-present

    foils are innocent individuals. Thus, suspect-ID accuracy duplicates the perspective of law

    enforcement: given that an individual has been identified, what is the probability that this

  • GRABMAN 23

    individual is the guilty suspect (i.e., TPc) and not an innocent suspect (i.e., TAfa/6 with fair

    lineups).

    Because our modeling procedure does not allow for the suspect-Id adjustment without a

    substantial loss of TAfa responses (e.g., removal of 5/6 of the false alarm responses), we

    analyzed a quasi-suspect-Id accuracy score: the ratio of correct responses to target present

    lineups [i.e., TPc] over the sum of TPc and false alarms to target absent lineups [i.e. TPc/(TPc +

    TAfa)].

    We examined suspect-Id accuracy using the same backward stepwise procedure detailed

    in the main document. Written in Wilkinson-Rodgers notation, the best fitting model of suspect-

    Id accuracy consists of several main effects and two-way interactions: Accuracy ~ LineupRace +

    Confidence + Delay + DecisionTime + CFMT + Justification + LineupRace:Confidence +

    Confidence:DecisionTime + Confidence:CFMT + Confidence:Justification +

    DecisionTime:CFMT + DecisionTime:Justification + (1|Participant). Both computed absolute fit

    indices supported that this model adequately explained the data (KS D = .013, p = .812, pseudo-

    R2GLMM(m) = .353), as did visual inspection of the residual plots.

    Likelihood ratio tests showed comparable patterns to the identification accuracy model.

    There were significant main effects of lineup-race, χ2(1) = 4.42, p = .036, delay, χ2(1) = 6.07, p

    = .014, confidence, χ2(1) = 16.04, p < .001, CFMT, χ2(1) = 32.39, p < .001, and justification-

    type, χ2(5) = 14.07, p = .015. As expected, the main effect of delay reflects better accuracy in the

    5-minute (56.8%, 95% CI [52.3, 61.1]) than the 1-day (49.2%, 95% CI [44.9, 53.6]) condition.

    Crucially, we highlight the similar interactions patterns between confidence and (a)

    CFMT, χ2(1) = 3.13, p = .077, (b) decision-time, χ2(1) = 12.92, p < .001, (c) lineup-race, χ2(1) =

    4.08, p = .043, and (d) justification-type, χ2(5) = 24.37, p < .001. As seen in Figure 1A.7a-d,

  • GRABMAN 24

    these suspect-Id results are consistent with the identification accuracy model. Specifically, high

    confidence is associated with more errors for (a) poor face recognizers, (b) slower decision

    times, and (d) F/FO justifications, but also diminished other-race effects (c). All other effects are

    non-significant (ps > .071).

    1A.7. Suspect-Id interactions between Confidence and (A) CFMT, (B) Decision-time, (c) Lineup

    Race, and (D) Justification-type. Lines represent model estimates, with error shading

    representing the 95% confidence interval.

  • GRABMAN 25

    Study 1A Discussion

    Recent research suggests that high confidence eyewitness identifications are generally

    reliable (Wixted & Wells, 2017). Our study adds important caveats to this assessment. We

    document three factors that are systematically related to high confidence misidentifications: (a)

    the speed of the decision, (b) the basis for an identification from a lineup, and (c) face

    recognition ability.

    Decision-time is strongly related to high confidence misidentifications. Consistent with

    past studies (e.g., Brewer & Wells, 2006; Dodson & Dobolyi, 2016; Sauerland & Sporer, 2007,

    2009), we observed that fast and confident identifications – presented in Figure 1A.3b -- are

    many times more accurate than fast and unconfident identifications. But, the key point is that

    there is a sharp increase in high confidence errors with longer decision times. Whereas highest

    confidence (100%) identifications made in the initial seconds are nearly always accurate, these

    identifications fall to nearly 75% accuracy when decision-time increases to 6 seconds and after

    20 seconds these reports are roughly 50% accurate (see Brewer & Wells, 2006; Sauerland &

    Sporer, 2009 for a similar pattern). As Dodson and Dobolyi (2016) suggest, participants appear

    to adopt an increasingly liberal criterion for making high confidence identifications with

    increasing decision-time – causing an increase in high confidence errors.

    Additionally, consistent with Dobolyi & Dodson (2018), familiarity justifications are

    more frequently associated with high confidence misidentifications than are justifications that

    refer to either an expression of recognition, or (un)observable feature(s) about the suspect.

    Moreover, this relationship persisted across a longer delay than previously studied, and after

    accounting for the effects of face recognition ability. With both the Department of Justice (Yates,

    2017) and the National Academy of Sciences (National Research Council, 2014) advising law

  • GRABMAN 26

    enforcement to note the exact wording of an eyewitness’s identification, our finding provides

    investigators with an additional layer of information with which to assess witness credibility.

    Finally, for the first time, we show that the Cambridge Face Memory Test predicts the

    likely accuracy of high confidence identifications. Poor face recognizers are much more

    vulnerable than strong face recognizers to make high confidence misidentifications. Even when

    individuals are 100% confident, Figure 1A.3a shows that the average face recognizer (i.e.,

    median CFMT score of 61) is much more likely than the strongest face recognizers (i.e., CFMT

    score of 72) to make a high confidence misidentification – with below-average face recognizers

    even more vulnerable to making high confidence errors.

    This finding supports the ‘optimality’ account, wherein the predictive value of a

    confidence statement is directly tied to the quality of the face representation (Deffenbacher,

    1980). As poorer face recognizers encode less robust representations of target faces, high

    confidence is a less reliable indicator of accuracy than for better recognizers. However, as a

    counterpoint to the optimality account, many studies find that eyewitnesses adjust their use of

    high confidence ratings to maintain impressive levels of accuracy in non-ideal encoding

    conditions, such as lengthy retention intervals, and increased viewing distances (Semmler et al.,

    2018; Wixted & Wells, 2017). Further research will be necessary to disentangle these accounts,

    especially studies incorporating measures of individual differences.

    An additional question that needs further clarification is why poor face recognizers use

    high confidence ratings for (presumably) weak face representations. As the present experiment

    was not designed to answer this question, we can only speculate. However, a large body of

    literature shows that people can severely overestimate their competence when they perform

    poorly on a task, and correspondingly exhibit overconfidence (e.g., Kruger & Dunning, 1999;

  • GRABMAN 27

    Lichtenstein & Fischhoff, 1977). These errors occur most frequently in content areas that people

    lack knowledge, and/or receive minimal feedback on performance. Although it seems like there

    should be consistent feedback on face recognition ability (e.g., embarrassingly introducing

    oneself to a person met the night before), there is an ongoing debate about the degree to which

    people have insight into their face recognition ability (Bobak, Mileva, & Hancock, 2018; Gray,

    Geoffrey, & Richard, 2017). It is conceivable that poor recognizers underestimate the extent of

    their deficiency, and/or place undue emphasis on non-diagnostic memory signals.

    With respect to non-identifications, we highlight two factors that were related to the

    accuracy of a “not present” response. First, stronger face recognizers (i.e., higher CFMT scores)

    were more accurate at correctly rejecting lineups than were poorer face recognizers, presumably

    because their more robust representations of previously seen faces allowed them to recognize

    when a target individual was absent from a lineup.

    Second, contrary to research that has observed little relationship between confidence and

    non-chooser accuracy (e.g., Dodson & Dobolyi, 2016; Sauerland & Sporer, 2009), we found that

    confidence in non-chooser decisions was informative, such that highly confident rejections were

    more often correct than were low confidence rejections. But, consistent with previous findings,

    confidence is a stronger predictor of chooser accuracy than non-chooser accuracy (e.g., Brewer

    & Wells, 2006). We believe that the conflicting findings about confidence and non-chooser

    accuracy between this study and previous work stems from our decision to model chooser and

    non-chooser responses separately. To illustrate this point, we followed past studies and

    constructed a single model of chooser and non-chooser accuracy and found that confidence did

    not significantly predict non-chooser accuracy. However, there are qualitative differences

    between chooser and non-chooser decisions, as evidenced by changes in the relative use of

  • GRABMAN 28

    justification categories, which suggests individuals may adjust how they use the confidence scale

    in these two situations. Reinforcing the impact of the modeling procedure, Wixted and Wells

    (2017) isolated non-chooser responses from a dataset provided by Wetmore et al. (2015), and

    similarly found that high confidence rejections were more accurate than were those made with

    lower confidence.

    In sum, existing research on eyewitness identification has focused on the average

    individual and has shown that a participant’s confidence rating about an identification is

    informative of its accuracy (Wixted & Wells, 2017). We show that high confidence

    identifications do not protect against the increase in errors that accompany poorer face

    recognition ability, increasing decision-time or the use of familiarity as a justification for a

    response. Taken together, this study suggests that the justice system should take both individual

    differences and confidence into account when determining the likely accuracy of an eyewitness

    decision.

  • GRABMAN 29

    Study 1A References

    Andersen, S. M., Carlson, C. A., Carlson, M. A., & Gronlund, S. D. (2014). Individual

    differences predict eyewitness identification performance. Personality and Individual

    Differences, 60, 36-40.

    Barton, K. (2018) MuMIn: Multi-model inference. R package version 1.42.1. https://CRAN.R-

    project.org/package=MuMIn

    Bates, D., Maechler, M., Bolker, B., & Walker, S. (2014). lme4: Linear mixed-effects models

    using Eigen and S4. R package version 1.1-21.

    Bindemann, M., Brown, C., Koyas, T., & Russ, A. (2012). Individual differences in face

    identification postdict eyewitness accuracy. Journal of Applied Research in Memory and

    Cognition, 1(2), 96-103.

    Bobak, A. K., Mileva, V. R., & Hancock, P. J. (2018). Facing the facts: Naive participants have

    only moderate insight into their face recognition and face perception abilities. Quarterly

    Journal of Experimental Psychology, https://doi.org/10.1177/1747021818776145.

    Brewer, N., & Wells, G. L. (2006). The confidence-accuracy relationship in eyewitness

    identification: effects of lineup instructions, foil similarity, and target-absent base rates.

    Journal of Experimental Psychology: Applied, 12(1), 11-30.

    Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A

    practical information-theoretic approach (2nd ed.). New York, NY: Springer-Verlag.

    Cho, S. J., Wilmer, J., Herzmann, G., McGugin, R. W., Fiset, D., Van Gulick, A. E., ... &

    Gauthier, I. (2015). Item response theory analyses of the Cambridge Face Memory Test

    (CFMT). Psychological assessment, 27(2), 552-566.

    Deffenbacher, K. A. (1980). Eyewitness accuracy and confidence: Can we infer anything about

    their relationship?. Law and Human Behavior, 4(4), 243-260.

    De Lestrade, J. X. (2001). Murder on a Sunday Morning. Docurama.

    Dobolyi, D. G., & Dodson, C. S. (2013). Eyewitness confidence in simultaneous and sequential

    lineups: A criterion shift account for sequential mistaken identification overconfidence.

    Journal of Experimental Psychology: Applied, 19(4), 345-357.

    Dobolyi, D. G., & Dodson, C. S. (2018). Actual vs. perceived eyewitness accuracy and

    confidence and the featural justification effect. Journal of Experimental Psychology:

    Applied. Advance online publication. http://dx.doi.org/10.1037/xap0000182

    Dodson, C. S., & Dobolyi, D. G. (2016). Confidence and Eyewitness Identifications: The Cross-

    Race Effect, Decision Time and Accuracy. Applied Cognitive Psychology, 30(1), 113-

    125.

    Duchaine, B., & Nakayama, K. (2006). The Cambridge Face Memory Test: Results for

    neurologically intact individuals and an investigation of its validity using inverted face

    stimuli and prosopagnosic participants. Neuropsychologia, 44(4), 576-585.

    Dunning, D., & Stern, L. B. (1994). Distinguishing accurate from inaccurate eyewitness

    identifications via inquiries about decision processes. Journal of Personality and Social

    Psychology, 67(5), 818.

    Fox, J. (2003). Effect displayes in R for generalized linear models. Journal of Statistical

    Software, 8(15), 1-27.

    Gray, K. L., Bird, G., & Cook, R. (2017). Robust associations between the 20-item

    prosopagnosia index and the Cambridge Face Memory Test in the general population.

    Royal Society open science, 4(3), https://doi.org/10.1098/rsos.160923.

  • GRABMAN 30

    Hartig, F. (2018). DHARMa: Residual diagnostics for hierarchical (mulit-level/mixed) regression

    models. R package version 0.2.0.

    Kruger, J., & Dunning, D. (1999). Unskilled and unaware of it: how difficulties in recognizing

    one's own incompetence lead to inflated self-assessments. Journal of personality and

    social psychology, 77(6), 1121-1134.

    Lichtenstein, S., & Fischhoff, B. (1977). Do those who know more also know more about how much they know? Organizational Behavior and Human Performance, 20(2), 159–183. doi:10.1016/0030-5073(77)90001-0

    McNeish, D. M., & Stapleton, L. M. (2016). The effect of small sample size on two-level model

    estimates: A review and illustration. Educational Psychology Review, 28(2), 295-314.

    Meissner, C. A., & Brigham, J. C. (2001). Thirty years of investigating the own-race bias in

    memory for faces: A meta-analytic review. Psychology, Public Policy, and Law, 7(1), 3-

    35.

    Mickes, L. (2015). Receiver operating characteristic analysis and confidence–accuracy

    characteristic analysis in investigations of system variables and estimator variables that

    affect eyewitness memory. Journal of Applied Research in Memory and Cognition, 4(2),

    93-102.

    Morgan III, C. A., Hazlett, G., Baranoski, M., Doran, A., Southwick, S., & Loftus, E. (2007).

    Accuracy of eyewitness identification is significantly associated with performance on a

    standardized test of face recognition. International Journal of Law and Psychiatry, 30(3),

    213-223.

    Nakagawa, S., & Schielzeth, H. (2013). A general and simple method for obtaining R2 from

    generalized linear mixed‐effects models. Methods in Ecology and Evolution, 4(2), 133-

    142.

    National Research Council. (2014). Identifying the culprit: Assessing eyewitness identification.

    Washington, DC: The National Academies Press.

    Nguyen, T. B., Pezdek, K., & Wixted, J. T. (2017). Evidence for a confidence–accuracy

    relationship in memory for same-and cross-race faces. The Quarterly Journal of

    Experimental Psychology, 70(12), 2518-2534.

    Russell, R., Duchaine, B., & Nakayama, K. (2009). Super-recognizers: People with extraordinary

    face recognition ability. Psychonomic bulletin & review, 16(2), 252-257.

    Sauer, J., Brewer, N., Zweck, T., & Weber, N. (2010). The effect of retention interval on the

    confidence–accuracy relationship for eyewitness identification. Law and Human

    Behavior, 34(4), 337-347.

    Sauerland, M., & Sporer, S. L. (2007). Post-decision confidence, decision time, and self-reported

    decision processes as postdictors of identification accuracy. Psychology, Crime & Law,

    13(6), 611-625.

    Sauerland, M., & Sporer, S. L. (2009). Fast and confident: Postdicting eyewitness identification

    accuracy in a field study. Journal of Experimental Psychology: Applied, 15(1), 46-62.

    Semmler, C., Dunn, J., Mickes, L., & Wixted, J. T. (2018). The role of estimator variables in eyewitness identification. Journal of Experimental Psychology: Applied, 24(3), 400-415.

    Singmann, H., Bolker, B., Westfall, J., & Aust, F. (2018). afex: Analysis of factorial experiments.

    R package version 0.21-2.

  • GRABMAN 31

    Wan, L., Crookes, K., Dawel, A., Pidcock, M., Hall, A., & McKone, E. (2017). Face-blind for

    other-race faces: Individual differences in other-race recognition impairments. Journal of

    Experimental Psychology: General, 146(1), 102.

    Wetmore, S. A., Neuschatz, J. S., Gronlund, S. D., Wooten, A., Goodsell, C. A., & Carlson, C. A.

    (2015). Effect of retention interval on showup and lineup performance. Journal of

    Applied Research in Memory and Cognition, 4(1), 8-14.

    Wilkinson, G. N., & Rogers, C. E. (1973). Symbolic Description of Factorial Models for

    Analysis of Variance. Applied Statistics, 22, 392-399. doi: 10.2307/2346786

    Wilmer, J. B. (2017). Individual differences in face recognition: A decade of discovery. Current

    Directions in Psychological Science, 26(3), 225-230.

    Wilmer, J. B., Germine, L., Chabris, C. F., Chatterjee, G., Gerbasi, M., & Nakayama, K. (2012).

    Capturing specific abilities as a window into human individuality: The example of face

    recognition. Cognitive Neuropsychology, 29(5-6), 360-392.

    Wilmer, J. B., Germine, L., Chabris, C. F., Chatterjee, G., Williams, M., Loken, E., ... &

    Duchaine, B. (2010). Human face recognition ability is specific and highly heritable.

    Proceedings of the National Academy of sciences, 107(11), 5238-5241.

    Wixted, J. T., Mickes, L., Dunn, J. C., Clark, S. E., & Wells, W. (2016). Estimating the reliability

    of eyewitness identifications from police lineups. Proceedings of the National Academy

    of Sciences, 113(2), 304-309.

    Wixted, J. T., Read, J. D., & Lindsay, D. S. (2016). The effect of retention interval on the

    eyewitness identification confidence–accuracy relationship. Journal of Applied Research

    in Memory and Cognition, 5(2), 192-203.

    Wixted, J. T., & Wells, G. L. (2017). The relationship between eyewitness confidence and

    identification accuracy: A new synthesis. Psychological Science in the Public Interest,

    18(1), 10-65.

    van der Ploeg, T., Austin, P. C., & Steyerberg, E. W. (2014). Modern modelling techniques are

    data hungry: a simulation study for predicting dichotomous endpoints. BMC medical

    research methodology, 14(1), 137.

    Yates, S.Q. (2017, Jan 6). Memorandum for heads of department law enforcement components

    all department prosecutors. Subject: Eyewitness identification: Procedures for conducting

    photo arrays. https://www.justice.gov/archives/opa/press-release/file/923201/download.

  • GRABMAN 32

    Study 1B. Stark Individual Differences: Face Recognition Ability Influences the

    Relationship Between Confidence and Accuracy in a Recognition Test of Game of Thrones

    Actors (Grabman & Dodson, submitted)

    Most people have experienced the embarrassment of greeting a stranger as if they were a

    recent acquaintance. Whether we risk this social faux pas depends on our certainty that we

    previously encountered this individual. In higher stakes contexts, eyewitness confidence has

    profound effects on the criminal justice system. Juror decisions are strongly influenced by

    confidence (Brewer & Burke, 2002), and judges are instructed to use certainty as an indicator of

    whether to admit the witness’s testimony in court (Neil vs. Biggers, 1972). The question is how

    probative confidence is of face recognition accuracy.

    In an influential review of the eyewitness literature, Wixted and Wells (2017) found that

    high confidence identifications are generally accurate. This relationship holds over changes in

    retention interval (i.e., the amount of time between study and test) (see Wixted, Read, et al., 2016

    for a review), exposure duration (i.e., the amount of time a face is viewed at encoding) (e.g.,

    Palmer, Brewer, Weber, & Nagesh, 2013), and a variety of other manipulations (see Wixted &

    Wells, 2017 for a review). However, there is a compelling need for studies of the confidence-

    accuracy relationship which capture the richness of the real-world face viewing experience.

    The fact that the average person can recognize thousands of unique faces (Jenkins,

    Dowsett, & Burton, 2018) masks aspects of this task that are remarkably complex. Faces are

    encountered in a myriad of contexts, often with considerable changes in lighting, orientation, and

    other characteristics (e.g., hair, age, clothing, etc.). While the majority of people can easily

    recognize family members and friends in a variety of situations, this task is far more challenging

    for unfamiliar faces (Kramer, Young, & Burton, 2018). As some examples of this difficulty,

  • GRABMAN 33

    growing literature suggests that minimal disguises (such as sunglasses) can impair face

    recognition accuracy (Mansour, Beaudry, & Lindsay, 2017; Nguyen & Pezdek, 2017; Righi,

    Peissig, & Tarr, 2012; Terry, 1994). Moreover, studies in the face matching literature (i.e.,

    indicating whether two simultaneously presented faces are the same person or different people),

    show that subtle changes in viewing conditions (e.g., photos of the same person taken with

    different cameras) can substantially decrease matching decision accuracy (see Young & Burton,

    2017 for a review).

    Given the complexity of real-world face recognition, claims about the value of high

    confidence are complicated by multiple factors. First, participants in past studies generally knew

    that they were in an experiment, which potentially alters their face encoding strategies. Second,

    exposure durations are shorter than those experienced in everyday life (e.g., 90-seconds), and

    retention-intervals rarely longer than a few weeks (though see Read, Lindsay, & Nicholls, 1998

    for an exception). Third, most studies use single-trial designs, which limits conclusions to the

    small group of people presented. Finally, there is typically a single context for encoding faces,

    whereas in practice we must learn to recognize people (often encountered more than once) in

    varied environments.

    Additionally, a largely ignored aspect of the confidence-accuracy relationship in the

    eyewitness literature is heterogeneity in unfamiliar face recognition ability (Duchaine &

    Nakayama, 2006). Skill in this domain ranges from people with developmental prosopagnosia

    (i.e., face blindness), who may have difficulties recognizing even close family members (J. J. S.

    Barton & Corrow, 2016), to super-recognizers who are actively recruited to police departments

    for their face-recognition prowess (Ramon, Bobak, & White, 2019; Russell, Duchaine, &

    Nakayama, 2009). These differences are highly heritable (Shakeshaft & Plomin, 2015; Wilmer et

  • GRABMAN 34

    al., 2010; Zhu et al., 2010), and only weakly associated with general intelligence (Gignac,

    Shankaralingam, Walker, & Kilpatrick, 2016; Shakeshaft & Plomin, 2015; Wilhelm et al., 2010;

    Zhu et al., 2010).

    Multiple studies show that higher face recognition ability predicts increased accuracy in

    eyewitness identification tasks (Andersen, Carlson, Carlson, & Gronlund, 2014; Bindemann,

    Avetisyan, & Rakow, 2012; Morgan et al., 2007). But, only our group has investigated whether

    this skill influences the probative value of confidence in face recognition tasks. In contrast to

    previous research documenting a robust confidence-accuracy relationship across a wide range of

    manipulations, we found that weaker face recognizers are far more likely to make high

    confidence errors than are stronger recognizers (Grabman, Dobolyi, Berelovich, & Dodson,

    2019).

    However, there are several aspects that limit the real-world applicability of Grabman et al

    (2019). Participants viewed static images of faces at encoding and test, which fails to capture the

    experience of encountering moving people in varied contexts. Moreover, the study used

    relatively short exposure durations (3 repetitions of 3-seconds) and retention-intervals (up to 1

    day). It is possible that the impact of face recognition ability on the confidence-accuracy

    relationship is minimal with longer exposures or delays. Finally, the stimulus set consisted solely

    of young adult males, which further limits generalizability.

    Given the paucity of studies of the confidence-accuracy relationship under real-world

    viewing conditions, there are two aims for the current study. The first aim is to determine if the

    results from a more naturalistic setting mirror those of the carefully designed experiments cited

    in Wixted and Wells (2017). The second aim is to assess whether differences in face recognition

  • GRABMAN 35

    ability influence the confidence-accuracy relationship using a design that addresses each of the

    short-comings of our previous study (Grabman et al., 2019).

    To accomplish these aims, we leveraged a dataset published by Devue, Wride, and

    Grimshaw (2019), accessed using the Open Science Framework (OSF) (https://osf.io/wg8vx). In

    this study, participants viewed the first six seasons of the popular television show Game of

    Thrones (GoT) as the series aired, then completed a recognition task of 90 pictures of actors (not

    in character) intermixed with 90 strangers. Importantly, participants viewed the show for

    personal entertainment, meaning that all faces are incidentally encoded. Moreover, as Devue et

    al. (2019) note, there are several additional aspects of GoT that make it an appealing way to

    study real-world face recognition. Characters are seen in a variety of natural viewing contexts,

    with often substantial changes in appearance, lighting, clothing, age, and viewpoint.

    Additionally, screen-time is readily accessible from internet databases, allowing for assessment

    of exposure duration effects. There are many character deaths throughout the series, resulting in

    lengthy retention intervals between encoding and test for some actors. Finally, there are over 600

    actors listed in the show credits, which provides a substantial face corpus from which to prepare

    stimuli.

    From the standpoint of the current study aims, this dataset offers some additional

    advantages. Each participant completed a standard test of face-recognition, the Cambridge Face

    Memory Test+ (CFMT+), and provided confidence ratings for each decision. While the original

    authors examined associations between these variables and accuracy using correlational analysis,

    we use calibration curves, which are superior for assessing confidence-accuracy calibration

    (Wixted & Wells, 2017). And, for the first time, we analyze the conjunctive effects of confidence

    and face recognition ability on accuracy under real-world viewing conditions.

  • GRABMAN 36

    Additionally, whereas eyewitness studies typically use a criminal lineup paradigm,

    participants in Devue et al (2019) completed an old-new recognition task. As far as we are aware,

    only one other study has used calibration curves to examine the confidence-accuracy relationship

    in an old-new face recognition paradigm for a large set of items (> 100 trials) (Tekin & Roediger,

    2017). These researchers used a single exposure duration (2-seconds) and a short retention-

    interval (10 min), and found highest confidence identifications to be about 96% accurate. It is an

    open question whether this impressive accuracy generalizes to uncontrolled settings with longer

    retention-intervals and differing levels of exposure.

    Finally, the use of another group’s dataset carries the benefit of reducing ‘researcher

    degrees of freedom’. If stronger face recognizers continue to make fewer high confidence errors

    than weaker recognizers in an uncontrolled, naturalistic context then this bolsters claims that

    there are robust associations between face recognition ability, confidence, and accuracy.

    Methods

    Participants.

    Characteristics of the participants are reported in Devue et al., (2019). Briefly, the

    results are comprised of 32 participants (20 women and 12 men), aged between 19 and 56 years

    (M = 28.7 years ± 10.5), who completed the task 3-6 months after the end of the sixth season of

    GoT. All participants watched six seasons of GoT once, and in order as the show aired, with the

    exception of some who viewed both Seasons 1 and 2 during the same year. While the sample size

    is low, the large number of trials per participant (n = 168) fits with current recommendations for

    the logistic mixed effects analysis outlined in the Results section (e.g., McNeish & Stapleton,

    2016).

  • GRABMAN 37

    Materials.

    Cambridge Face Memory Test + (CFMT+). The CFMT+ is a frequently used test that

    assesses poor to superior face recognition ability (Russell et al., 2009). Participants memorize six

    male faces in three separate orientations. For each trial, previously viewed faces must be selected

    from an array of the target face and two foils. The test phase proceeds across 102 trials in five

    increasingly difficult blocks. Difficulty is manipulated with the use of novel images, visual noise

    filters, different levels of cropping, and (eventually) the use of a profile view with extra levels of

    noise. Scores can range from 0 – 102 correct responses, but in practice a score of 34 represents

    random guessing.

    Face Stimuli. Extensive details about the generation of the study materials are provided in

    Devue et al., (2019), with the materials themselves available on the OSF platform

    (https://osf.io/wg8vx). The researchers selected 84 actors from GoT from 15 conditions,

    consisting of the interaction between retention-interval since last viewing (Season 6, 5, 4, 3, 1/2)

    and three levels of exposure: ‘lead characters’ [20 – 90 min screen time], ‘support characters’ [9

    – 19 min], and ‘bit parts’ [ 123 min screen time] survived to the end of the sixth season,

    with the actors serving as training trials for the task. Ninety pictures of unfamiliar faces were

    collected to serve as foils (i.e., ‘new’ trials), and “matched the actor set in terms of head

    orientation, age range, facial expression, attractiveness, presence of make-up, facial hair, or

    glasses, hairstyle, clothing style, lighting, and picture quality” (Devue et al., 2019). While foils

    matched the characteristics of the sample of actors as a whole, they were not individually paired

    to specific actors.

  • GRABMAN 38

    In a similarity manipulation, half of the participants viewed photos of the actors which

    were similar to their last appearance on the show (similar), while the other half viewed photos

    that were as different as possible (dissimilar). These similarity groups were matched on CFMT+

    scores, age, and gender. Due to the scarcity of available photos for ‘bit part’ actors, all

    participants responded to both similar (17 trials) and dissimilar (13 trials) pictures for this

    exposure level, regardless of their assigned similarity condition.

    Procedure.

    Full details of the procedure are outlined in Devue et al., (2019), so we mention only

    those pertinent to the present study. Participants completed all tasks on a computer. Following

    the CFMT+, participants were assigned to a similarity condition, and then started the GoT face

    recognition task. An easy block consisting of the six ‘main heroes’ and six foils served to practice

    the task, and was followed by 168 test trials consisting of 84 actors intermixed with 84 foils.

    Each trial started with a fixation cross (500 ms), followed by a picture stimulus that remained in

    the center of the screen until the participant’s response or up to 3,000 ms. Participants pressed

    the ‘K’ key to indicate they had ‘seen’ the face before (in GoT or elsewhere), or pressed ‘L’ to

    indicate that the face was ‘new’. They then provided a confidence rating for this decision using a

    5-point scale (1 = not at all confident, 5 = totally confident).

    Results

    Data preparation.

    Following the lead of the original authors, we discarded 26 trials where participants

    indicated they recognized an actor from outside of GoT, as well as the training trials (6 ‘main

    heroes’ + 6 foils per participant). One trial was omitted due to a typo (i.e., score of ‘2’ on

    accuracy, when only 0 and 1 were possible). We also removed all trials where participants

  • GRABMAN 39

    responded in < 300 ms (n = 371; 6.9% of total trials), as this is faster than consistent findings on

    the time to process face identity, along with the additional time needed to perform a keystroke

    (e.g., Gosling & Eimer, 2011). In total, this left 4,979 responses from 32 participants. We have

    uploaded the data file used for the analysis to the OSF platform, along with a cleaned version of

    the original Devue et al. (2019) file that is more conducive toward coding environments (e.g., R,

    Python) (https://osf.io/quhsg).

    Table 1B.1 shows the breakdown of the frequency of responses into Hits (“Seen”|Actor),

    Misses (“New”|Actor), Correct Rejections (CR; “New”|Foil), and False Alarms (FA;

    “Seen”|Foil) by confidence level and a median split of CFMT+ performance, which we

    categorize as Weaker Face Recognizers (CFMT+ scores of 52-73) and Stronger Face

    Recognizers (CFMT+ scores of 74-90). Due to low frequencies of responses in confidence

    categories 1 and 2, we collapsed these levels to form a single confidence level (‘1-2’).

    CFMT+ Confidence Hit miss fa cr

    Weaker

    Face

    Recognizers

    [52,73]

    1-2 77 142 81 193

    3

    196 257 149 348

    4

    174 212 75 384

    5

    236 117 28 141

    Stronger

    Face

    Recognizers

    [74,90]

    1-2

    44 96 25 112

    3

    104 189 52 290

    4

    103 183 28 349

    5

    222 131 4 213

    Table 1B.1. Frequency of responses of Hits (Seen|Actor), Misses (New|Actor), Correct

    Rejections (CR; New|Unfamiliar), and False Alarms (FA; Seen|Unfamiliar) categorized by

    confidence level and CFMT+ Median split.

  • GRABMAN 40

    Tables 1B.2 and 1B.3 show the frequencies of hits, misses, correct rejections, and false

    alarms across CFMT+ median split for the exposure duration and retention-interval

    manipulations, respectively. Due to the single-block design, the same foil counts (i.e., false

    alarms and correct rejections) are present in all levels of these within-subjects manipulations. To

    obtain an adequate trial count for the retention-interval contrasts (especially at the upper-end of

    the confidence scale), we recoded this variable into ‘Long Delay’ (Seasons 1-3; 34 actors),

    ‘Medium Delay’ (Seasons 4-5; 32 actors), and ‘Short Delay’ (Season 6; 18 actors) conditions,

    based on comparable discriminability within these time periods. The exposure duration contrast

    is composed of ‘leading actors’ (longest exposure; 27 actors), ‘supporting actors’ (medium

    exposure; 27 actors), and ‘bit parts’ (shortest exposure; 30 actors).

    Finally, Table 1B.4 shows the counts for the between-subjects similarity manipulation.

    We removed ‘bit part’ actors who did not match the condition assigned to the participant (e.g.,

    dissimilar ‘bit part’ photos in the similar condition). Note that removing the ‘bit part’ actors

    causes a slight difference in the total actor counts (i.e., hits + misses) for the similarity

    manipulation as compared to the total count for the full sample and the other manipulations.

  • GRABMAN 41

    CFMT+ Confidence Exposure hit miss fa cr Weaker

    Face

    Recognizers

    [52,73]

    1-2 ‘Bit Parts’ 28 61 81 193

    ‘Supports’ 25 51

    ‘Leads’ 24 30

    3 ‘Bit Parts’ 73 138 149 348

    ‘Supports’ 63 76

    ‘Leads’ 60 43

    4 ‘Bit Parts’ 26 115 75 384

    ‘Supports’ 75 60

    ‘Leads’ 73 37

    5 ‘Bit Parts’ 13 53 28 141

    ‘Supports’ 62 37

    ‘Leads’ 161 27

    Stronger

    Face

    Recognizers

    [74,90]

    1-2 ‘Bit Parts’ 15 41 25 112

    ‘Supports’ 21 31

    ‘Leads’ 8 24

    3 ‘Bit Parts’ 34 97 52 290

    ‘Supports’ 38 62

    ‘Leads’ 32 30

    4 ‘Bit Parts’ 19 104 28 349

    ‘Supports’ 43 49

    ‘Leads’ 41 30

    5 ‘Bit Parts’ 0 73 4 213

    ‘Supports’ 58 33

    ‘Leads’ 164 25

    Table 1B.2. Frequency of Hits, Misses, Correct Rejections (CR), and False Alarms (FA),

    categorized by short (‘bit parts’), medium (‘supports’) and long (‘leads’) exposures, as well as

    CFMT+ Median split.

  • GRABMAN 42

    CFMT+ Confidence Delay hit miss fa cr Weaker

    Face

    Recognizers

    [52,73]

    1-2 Long 33 71 81 193 Medium 29 44 Short 15 27

    3 Long 77 122 149 348 Medium 89 91 Short 30 44

    4 Long 55 98 75 384 Medium 74 70 Short 45 44

    5 Long 72 37 28 141 Medium 86 56 Short 78 24

    Stronger

    Face

    Recognizers

    [74,90]

    1-2 Long 18 47 25 112 Medium 19 32 Short 7 17

    3 Long 43 77 52 290 Medium 45 79 Short 16 33

    4 Long 36 74 28 349 Medium 32 78 Short 35 31

    5 Long 68 59 4 213 Medium 85 44 Short 69 28

    Table 1B.3. Frequency of Hits, Misses, Correct Rejections (CR), and False Alarms (FA)

    categorized by long (Seasons 1-3), medium (Seasons 4-5) and short (Seasons 6) retention-

    intervals, as well as CFMT+ Median split.

  • GRABMAN 43

    Similarity CFMT+ Confidence hit miss fa cr

    Similar

    Weaker

    Face

    Recognizers

    [52,73]

    1-2 28 62 27 92

    3 54 102 58 175

    4 57 87 36 181

    5 96 73 18 122

    Stronger

    Face

    Recognizers

    [74,90]

    1-2 23 39 16 54

    3 41 82 21 144

    4 24 85 5 162

    5 62 64 3 122

    Dissimilar

    Weaker

    Face

    Recognizers

    [52,73]

    1-2 38 48 54 101

    3 105 89 91 173

    4 106 61 39 203

    5 136 12 10 19

    Stronger

    Face

    Recognizers

    [74,90]

    1-2 16 32 9 58

    3 51 62 31 146

    4 72 45 23 187

    5 160 30 1 91

    Table 1B.4. Frequency of Hits, Misses, Correct Rejections (CR), and False Alarms (FA)

    categorized by whether actors’ looked similar to their last appearance on the show (‘similar’) or

    as dissimilar as possible (‘dissimilar’), as well as CFMT+ Median split. Note that trial counts do

    not match Table 1B.1 because of the removal of ‘bit part’ actors who did not match the condition

    assigned to the participant (e.g., dissimilar ‘bit part’ photos in the similar condition).

  • GRABMAN 44

    Is there a strong relationship between confidence and accuracy in a real-world viewing context?

    Devue et al., (2019) analyzed the relationship between confidence and overall accuracy

    using Pearson’s correlation coefficients. This analysis found minimal associations between

    overall accuracy (centered and scaled) and average confidence on accurate trials (r = .125), as

    well as average confidence on inaccurate trials (r = - .096).

    One issue with defining the confidence-accuracy relationship in terms of overall accuracy

    is that research generally shows a stronger correspondence between confidence and accuracy for

    identifications (i.e., ‘seen’ responses) than non-identifications (i.e., ‘new’ responses) (e.g.,

    Brewer & Wells, 2006). Separating these response types may reveal more robust relationships

    than previously reported. Additionally, correlation analysis addresses a fundamentally different

    question than is typically of interest to applied memory researchers (Juslin, Olsson, & Winman,

    1996). Whereas correlation coefficients measure covariation, or the tendency for one variable to

    increase/decrease as another variable increases/decreases, applied researchers are generally more

    interested in the accuracy of responses made with a particular level of confidence.

    As a concrete example of this difference, imagine that a participant provides the highest

    possible confidence rating to every trial. The correlation between confidence and accuracy is

    zero because, regardless of whether accuracy increases/decreases, confidence remains the same.

    However, despite there being zero correlation, the participant would be perfectly calibrated if

    they were correct on every trial. Given that the participant used the highest possible confidence

    rating, we observed their response to be correct 100% of the time.

    An easy way to visualize the probative value of confidence is with a calibration curve

    (Tekin & Roediger, 2017; see also Mickes, 2015). Along the X-axis are progressively increasing

    confidence values. On the Y-axis is a proportion representing the number of correct items over

  • GRABMAN 45

    the sum total of items at this level of confidence (i.e., correct / (correct + incorrect)). Points are

    plotted representing Y-accuracy at X-confidence level. The slope of the lines connecting the

    points provides additional information. Upward sloping lines signal increasing accuracy with

    higher levels of confidence, whereas flat lines indicate little difference in predictive power

    between two confidence ratings.

    Figure 1B.1 shows the calibration curves for all identification (‘seen’) (hits/[fa + hits])

    and non-identification (‘new’) (cr/[cr + misses]) responses in the GoT task, collapsed across

    participants. Replicating the eyewitness research, there is clearly a strong positive relationship

    between higher confidence responses and identification accuracy. The highest confidence level

    (‘5’) boasts accuracy rates of 93.5% (95% HDI1, [89.8, 97.0]), as compared to 53.3% (95% HDI,

    [46.3, 61.3]) at the lowest level (‘1-2’). However, as indicated by the flat line in the right panel,

    there is little association between confidence and accuracy for non-identifications.

    Figure 1B.1. Calibration curves for the full sample of responses. Notably, there is a strong

    relationship between confidence and accuracy for identifications (left panel), but weaker

    associations for non-identifications (right panel). The dashed lines at 50% reflect chance

    accuracy. Error bars reflect 95% HDIs.

    1 Highest Density Intervals (HDI) are presented for consistency with later analyses. These

    intervals are based on 10,000 bootstrapped resamples and reflect 95% of values where the probability

    density is greater than points outside these bounds.

  • GRABMAN 46

    Next, we examined the impact of exposure duration (‘leads’ vs. ‘supports’ vs. ‘bit parts’;

    within-subjects), retention-interval (‘long’ [S1-3] vs. ‘medium’ [S4-5] vs. ‘short’ [S6]; within-

    subjects), and similarity (‘similar’ vs. ‘dissimilar’’; between-subjects) on the predictive value of

    confidence ratings. We analyzed each of these manipulations separately (i.e., main effects), as

    there are too few data-points per cell to assess interactions.

    Because foils are not matched to specific actors in this single-block design, the same false

    alarms and correct rejections must be used in (non-)identification accuracy calculations for each

    condition. However, before computing accuracy scores, we needed to account for the unequal

    numbers of actor trials across conditions. Without an adjustment, the same hit/false alarm rates

    (at a given level of confidence) can produce different calibration curves.

    For example, imagine that participants respond ‘seen’ to 50% of actor trials and 25% of

    foil trials with a given level of confidence for both short (18 actors) and medium (32 actors)

    retention-intervals (i.e., hit rate = 50%, false alarm rate = 25% at this level of confidence).

    Multiplying out (and assuming no data eliminations), this gives 18 actors * .50 hit rate * 32

    participants = 288 hits vs. 32 actors * .50 hit rate * 32 participants = 512 hits for the short and

    medium conditions, respectively. Naively, these trials would be compared against 84 foils * .25

    false alarm rate * 32 participants = 672 false alarms for both groups. Using the formula for

    identification accuracy [hits / (hits + fa)], we would find accuracy rates of 288 hits / (288 hits +

    672 fa) ≈ 43% and 512 hits / (512 hits + 672 fa) ≈ 76%, for the short and medium retention-

    intervals, respectively. In other words, despite the same use of the confidence scale across

    conditions, a difference of ~33% emerges due to disparities in the number of actor trials.

    Moreover, both group’s values are far from the nominal identification accuracy rate expected

    with a study design implementing equal numbers of actor to foil trials, or .50/ (.50 + .25) ≈ 67%.

  • GRABMAN 47

    To ensure comparability between conditions, we adjusted the frequency of f