Richard Wollert, Ph.D. Diane Lytton, Ph.D. Jacqueline Waggoner, Ed.D. Marc Goulet, Ph.D

Projection Rights Reserved.

2005 ATSA Convention, Nov 16-19, Salt Lake City

1

Competent Use Of Actuarials Requires Understanding Sample-Wise Variations In Both

Recidivism And Test Accuracy

Richard Wollert, Ph.D.

Diane Lytton, Ph.D.

Jacqueline Waggoner, Ed.D.

Marc Goulet, Ph.D.

Available at http://richardwollert.com



2

In A 2004 Article In Sexual Abuse,

Doren Compared The 5-Year Score-Wise

Recidivism Rates For The Construction

Samples Of The RRASOR And Static-99

With A Number Of Generalization Samples.

Notes on Abbreviations

Score-wise Recidivism = SWR = Rate for a given point total

Construction Sample = CS = Developmental Sample

Generalization Sample = GS = A Comparison Sample



3

The Purpose Of These Comparisons

• “To discover the degree to which the risk percentages for each instrument score replicate across different samples and different underlying base rates” (p. 27).



4

As A First Step In This Study, Many Data Sets Were Obtained From Different Sources

• The data sets, or generalization samples, reported the number of recidivists and non-recidivists at each test score.



5

Two Procedures Were Used To Combine The Data From the GSs

• 1. Recidivism data for all GSs were pooled into a single “mega-sample” that was stratified by test scores.– Samples with BRs below that of the CS were not

differentiated from samples with higher BRs.

• 2. The data from the GSs were combined to form 8 “semi-overlapping groups” that varied in their overall recidivism rates (from about 6% to 29%).– These were also stratified by test scores.



6

The Table Below Shows How GSs Were Combined To Form 8 “Semi-Overlapping” Groups

Contents of Overlapping Groups For RRASOR And Static-99 Analysis

Source 1=6% 2=7% 3=10% 4=14% 5=17% 6=21% 7=26% 8=29%

T X X X

S X X X X

N X X X X X

B X X X X X

K X X X X X

M X X X X X

L X X X X X

D X X X X X X

R X X X X X

V X X X X



7

The Data Were Analyzed Using Two Chi-Square Designs

• 1. The recidivism rate for each test score in the CS was compared with the rate for the corresponding score that was derived from the mega-sample.

• 2. The recidivism rate for each test score in the CS was compared with the rate for the corresponding score from each of the overlapping samples.



8

Here Is A Format For Organizing RRASOR Data In The First Analysis. Two “Summary” Experience

Tables Are Contrasted At Each Of Six Levels.

Group →

RRASOR Score↓CS ALL GSs

0 .044

Rate for ALL When RRASOR=0=.038

1 .076 .093

2 .142 .115

3 .248 .277

4 .327 .338

5 .498 .472



9

Here Is A Format For Organizing RRASOR Data For The Second Analysis. Two Summary

Experience Tables Were Again Contrasted.

Group →

RRASOR Score↓CS OG-3

0 .044

Rate for 0G-3 When RRASOR=0=.038

1 .076 .092

2 .142 .115

3 .248 .260

4 .327 .302

5 .498 .447



10

Seven Other Sets Of Tables Like The One Above Were Also Part Of The Second Analysis Because Data Were Combined To Make 8 Groups. Here Is The Last One.

Group →

RRASOR Score↓CS OG-8

0 .044

Rate for 0G-8 When RRASOR=0=.162

1 .076 .236

2 .142 .190

3 .248 .383

4 .327 .416

5 .498 .548



11

Findings

• One significant difference in 13 tests was found when the SWR rates from the mega-sample were compared against those from the CS.

• Relatively few significant differences were observed when the recidivism rates for the overlapping groups with overall base rates ranging from 9% to 21% were compared against those from the CS.



12

A Number of Claims Were Based On These Patterns Of Non-significant Findings

• Every 5-year SWR rate from each of the CSs was “replicated” (p. 33) in the GSs.

• The SWR rates “remained essentially unchanged … through a range of plus or minus 6% around a center point” (p, 33). – For the RRASOR the center point was 13%.– For Static-99 it was 15%.



13

Some Guidelines For Evaluators Who Administer The RRASOR and Static-99 Were Also

Formulated On The Basis Of These Findings

• When using the RRASOR, they can always assign the SWR rates for the CS because no meaningful differences in SWR rates were found between the CS and groups with differing BRs.

• With Static-99 they should determine if an offender is from a parent population with a very high or low BR (because some differences were found in these regions). – It was recommended that the rate for the CS be assigned where

the BR for the parent population ranges from 9-21%.



14

The Author Also Claimed His Results Provided Empirical Evidence That SWR Rates Don’t Always Fluctuate

When The BR For One Sample Differs From Another

• In particular, he stated that “although it may have been believed that a sample’s underlying base rate could effect (sic) the interpretation of the actuarial instruments’ scores, that belief was found largely not supported (in my analysis) … the argument has become significantly weaker that an unknown sample recidivism base rate affects the interpretability of actuarial scores” (p. 34, Stability of the Interpretive Risk Percentages for the RRASOR and Static-99, 2004, Sexual Abuse,16, 25-36).



15

Some Evaluators May Be Tempted To Justify SVP Civil Commitment Recommendations On The Basis Of

This Article. Here Is One Possible Train Of Logic.

• Defendant Jones has a high RRASOR score.• The BR for the parent population from which defendant

Jones was drawn may be lower than that for the RRASOR CS.

• Doren has shown that SWR rates for high RRASOR scores are the same even when the BR for one sample is lower than another.

• The recidivism rate for high scorers in the CS sample is therefore applicable to Jones.



16

Experts Who Rely On This Argument Run The Risk Of Providing Information That Is Misleading

• Why? Because the research in question contains many methodological and conceptual flaws.

• We will discuss only one of these flaws today, but we believe it is so fundamental and devastating that it invalidates the findings, conclusions, and interpretations reported in the article of concern.



17

The Flaw Is This: The Original Research Question Was Posed Too Narrowly To Fully Address The Issue Of Replication

• From the stated purpose and the article’s context, it is apparent that replication was conceived of as simply the stability of recidivism rates for each score over different summary experience tables.



18

Score-Wise Recidivism Is Defined By A Math Formula, However. A Variation On Bayes’s Theorem, The Formula is E=PT/(PT+QF)

• P = The base rate for those with test scores that fall within a specified range of scores. – The range could include all scores (Case “A”, scores 0-6+ on

Static-99) or a subset of scores (Case “B”: scores of only 4-6+).

• Q = The non-recidivism rate, which is always 1-P.• T = The true positive fraction: The % of recidivists with

high scores in a specified range of scores. – Case A:# recidivists with 6+ scores/# recidivists with 0-6+ scores– Case B:# recidivists with 6+ scores/# recidivists with 4-6+ scores

• F = The false positive fraction: The % of non-recidivists with high scores in a range of scores.



19

Using Bayes’s Theorem To Calculate E For Case A (0-6+):E = (.180 x .256) / ((.180 x .256) + (.82 x .089)) = .39

Predictions & Outcomes Efficiency & Decision Rules Base Rate Recidivated Didn’t Recidivate Will Reoffend: 50 79 E(efficiency): C: 6-9 50/129 = .39

Won’t Reoffend: L: 0-5 145 812 Sum = 50 + 145 =

195 Sum = 79 + 812 =

891 All = 195+891 =

1086 Q = 1 – P = .82 T (sensitivity): F (1-specificity): P (base rate): 50/195 = .256 79/891 = .089 195/1086 = .180



20

Using Bayes’s Theorem To Calculate E For Case B (4-6+): E = (.315 x .371) / ((.315 x .371) + (.685 x .286)) = .374

Predictions & Outcomes Efficiency & Decision Rules Base Rate Recidivated Didn’t Recidivate Will Reoffend: 49 82 E(efficiency): C: 6+ 49/131 = .374

Won’t Reoffend: L: 4-5 34+49=83 65+140=205 Sum = 49+83

= 132 Sum = 82 + 205

= 287 All = 132+287

= 419 Q = 1 – P = .685 T (sensitivity): F (1-specificity): P (base rate): 49/132 = .371 82/287 = .286 132/419 = .315



21

Several Principles May Be Deduced From E = PT / (PT + QF)

• 1. Each score-wise rate reported in a summary experience table is the product of several variables (P, T, and F) that constitute an underlying (and rarely disseminated) “component” experience table.

• 2. Samples may have similar score-wise recidivism rates, but differ with respect to P, T, or F (see slide 22).

• 3. A score-wise recidivism rate is truly replicated only when the associated values of P, T, and F from different experience tables are replicated (also see slide 22).



22

Variations In P, T, and F May Be Found For Samples With Similar SWR Rates: An Example Using Static-99 Data (Note: E Is Obtained By Applying Bayes’s Theorem)

Hanson & Thornton (2003) Harris et al., 2003

Score ↓ E P T F E P T F

0

1

2

3

4 .26 .13 .44 .19 .28 .19 .25 .15

5

6 .39 .18 .26 .09 .37 .26 .30 .19



23

Other Principles

• 4. If T and F are stable, the recidivism rate will change only if P changes.

• 5. If P and Q are stable, the score-wise recidivism rate will change as a function of changes in the “likelihood” ratio of T/F.



24

These Mathematical Facts Hold Important Implications For The Research Being Analyzed

• Recall that the author concluded that the score-wise recidivism rates for samples with different overall recidivism rates did not differ from one another.

• Assuming that acceptance of the null hypothesis is justified, this can mean only one thing.

– The likelihood (T/F) ratio changed from one sample to another.

– Mossman, who has published many articles on ROC analysis and Bayes’s theorem, made the same point about Doren’s research in an article that has been accepted for publication in Sexual Abuse.



25

We Tested This Hypothesis After Obtaining The Frequency Data Analyzed In The Original Study

• Adopting 5 as a high score on the RRASOR, likelihood ratios were calculated for the construction sample and for all generalization samples where this was possible.– It was impossible to define LRs for 3 of 10 samples.

• Adopting 6 as a high score, equivalent calculations were undertaken for the Static-99 construction sample and for all generalization samples.



26

Other Steps Of The Re-analysis

• Upper and lower confidence intervals (p=.05) were established for the LRs from the RRASOR and Static-99.

• The LRs for the generalization samples were plotted against the confidence intervals for the construction sample.– Data for other scores were not analyzed because recidivism

rates for lower scores were correlated with recidivism rates for

maximum scores.



27

All Likelihood Ratios From The RRASOR Generalization Samples Were Significantly Different From The Likelihood

Ratio For The RRASOR Construction Sample

S – 35.0

M – 17.14

14.0 - UL H – 13.61

8.0 - LR H – 8.0

5.0 - LL H – 4.71

N – 4.50 B – 4.22

V – 2.57

K – 1.8 R – 2.2

BaseRate .18 .06 .09 .15 .17 .19 .26 .39



28

The Likelihood Ratios In 6 Of 7 Generalization Samples Were Significantly Different From The Likelihood

Ratio For The Static-99 Construction Sample

S – 4.29 N – 4.33 B – 4.16

4.0 – UL H – 3.97

M – 3.5

3.0 – LR H – 3.01

2.0 - LL H – 2.10

K – 1.89

R – 1.66 V – 1.54

BaseRate .18 .06 .09 .15 .17 .19 .26 .39



29

Correlational Analyses Indicated That Test Accuracy Decreased As Base Rates Increased

RRASOR LRs with sample-wise base rates:

-.52 (n = 8; p = .17).

Static-99 LRs with sample-wise base rates:

-.86 (n = 8; p < .01)



30

Implications Of This Re-analysis For The Research Under Consideration

• Score-wise recidivism rates were not replicated in the criticized research because similarities in rates were an artifact of fluctuations in likelihood ratios.

• Characterizing the principle that score-wise recidivism rates vary with base rate differences as a “belief” is misleading. As long as F and T are stable, it is a mathematical fact.

• Proposing guidelines for evaluators to follow that conflict with Bayes’s theorem is potentially harmful because of the increase in prediction errors that this may occasion.



31

Practice Implications

• Variations in detection indicia and base rates raise doubts about the applicability of published SWR rates for RRASOR and Static-99 to local populations.– Agencies that use these tests should consider re-norming them

on local populations.

• The correlational analyses suggest that these tests are most inaccurate for populations that are of greatest concern because of their high recidivism rates.

• When using these tests, examiners should disclose their assumptions about P, T, and F, and present data that support their assumptions.



32

Research Implications

• Current data on representative and large samples would facilitate meaningful replication research.

• Test developers might improve accuracy by investigating factors that produce fluctuations in likelihood ratios.– Why is accuracy so diminished in groups with high base rates?

• Component experience tables should be compiled to accompany summary tables. These tables should include frequency data for true positives, true negatives, false positives, and false negatives. Associated Bayesian values should also be included. Each table should describe subjects and sampling methods.



33

Component Experience Tables Should Be Compiled To Accompany Summary Experience Tables

Component Experience Table

Summary Table Bayesian Values Frequency Data

Score ↓ E P T F TPs FPs FNs TNs

0

1

2

3

4

5

Documents

Richard Wollert, Ph.D. Diane Lytton, Ph.D. Jacqueline Waggoner, Ed.D. Marc Goulet, Ph.D