23
PSYCHOLOGICAL BULLETIN Vol. 52, No. 3, 1955 ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS, OR CUTTING SCORES PAUL E. MEEHL University of Minnesota AND ALBERT ROSEN VA Hospital, Minneapolis, and University of Minnesota 1 In clinical practice, psychologists frequently participate in the making of vital decisions concerning the classification, treatment, prognosis, and disposition of individuals. In their attempts to increase the num- ber of correct classifications and pre- dictions, psychologists have de- veloped and applied many psycho- metric devices, such as patterns of test responses as well as cutting scores for scales, indices, and sign lists. Since diagnostic and prognostic statements can often be made with a high degree of accuracy purely on the basis of actuarial or experience tables (referred to hereinafter as base rates), a psychometric device, to be efficient, must make possible a greater number of correct decisions than could be made in terms of the base rates alone. The efficiency of the great majority of psychometric devices reported in the clinical psychology literature is difficult or impossible to evaluate for the following reasons: a. Base rates are virtually never reported. It is, therefore, difficult to determine whether or not a given device results in a greater number of correct decisions than would be possi- ble solely on the basis of the rates from previous experience. When, 1 From the Neuropsychiatric Service, VA Hospital, Minneapolis, Minnesota, and the Divisions of Psychiatry and Clinical Psy- chology of the University of Minnesota Medical School. The senior author carried on his part of this work in connection with his appointment to the Minnesota Center for the Philosophy of Science. 194 however, the base rates can be esti- mated, the reported claims of efficien- cy of psychometric instruments are often seen to be without foundation. b. In most reports, the distribution data provided are insufficient for the evaluation of the probable effi- ciency of the device in other settings where the base rates are markedly different. Moreover, the samples are almost always too small for the determination of optimal cutting lines for various decisions. c. Most psychometric devices are reported without cross-validation data. If a psychometric instrument is applied solely to the criterion groups from which it was developed, its reported validity and efficiency are likely to be spuriously high, especial- ly if the criterion groups are small. d. There is often a lack of clarity concerning the type of population in which a psychometric device can be effectively applied. e. Results are frequently reported only in terms of significance tests for differences between groups rather than in terms of the number of cor- rect decisions for individuals within the groups. The purposes of this paper are to examine current methodology in stud- ies of predictive and concurrent validity (1), and to present some methods for the evaluation of the efficiency of psychometric devices as well as for the improvement in the interpretations made from such de- vices. Actual studies reported in the

ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

PSYCHOLOGICAL BULLETINVol. 52, No. 3, 1955

ANTECEDENT PROBABILITY AND THE EFFICIENCY OFPSYCHOMETRIC SIGNS, PATTERNS,

OR CUTTING SCORESPAUL E. MEEHL

University of Minnesota

AND ALBERT ROSENVA Hospital, Minneapolis, and University of Minnesota1

In clinical practice, psychologistsfrequently participate in the makingof vital decisions concerning theclassification, treatment, prognosis,and disposition of individuals. Intheir attempts to increase the num-ber of correct classifications and pre-dictions, psychologists have de-veloped and applied many psycho-metric devices, such as patterns oftest responses as well as cuttingscores for scales, indices, and signlists. Since diagnostic and prognosticstatements can often be made with ahigh degree of accuracy purely on thebasis of actuarial or experience tables(referred to hereinafter as base rates),a psychometric device, to be efficient,must make possible a greater numberof correct decisions than could bemade in terms of the base rates alone.

The efficiency of the great majorityof psychometric devices reported inthe clinical psychology literature isdifficult or impossible to evaluate forthe following reasons:

a. Base rates are virtually neverreported. It is, therefore, difficultto determine whether or not a givendevice results in a greater number ofcorrect decisions than would be possi-ble solely on the basis of the ratesfrom previous experience. When,

1 From the Neuropsychiatric Service, VAHospital, Minneapolis, Minnesota, and theDivisions of Psychiatry and Clinical Psy-chology of the University of MinnesotaMedical School. The senior author carriedon his part of this work in connection with hisappointment to the Minnesota Center for thePhilosophy of Science.

194

however, the base rates can be esti-mated, the reported claims of efficien-cy of psychometric instruments areoften seen to be without foundation.

b. In most reports, the distributiondata provided are insufficient forthe evaluation of the probable effi-ciency of the device in other settingswhere the base rates are markedlydifferent. Moreover, the samplesare almost always too small for thedetermination of optimal cuttinglines for various decisions.

c. Most psychometric devices arereported without cross-validationdata. If a psychometric instrument isapplied solely to the criterion groupsfrom which it was developed, itsreported validity and efficiency arelikely to be spuriously high, especial-ly if the criterion groups are small.

d. There is often a lack of clarityconcerning the type of population inwhich a psychometric device can beeffectively applied.

e. Results are frequently reportedonly in terms of significance testsfor differences between groups ratherthan in terms of the number of cor-rect decisions for individuals withinthe groups.

The purposes of this paper are toexamine current methodology in stud-ies of predictive and concurrentvalidity (1), and to present somemethods for the evaluation of theefficiency of psychometric devices aswell as for the improvement in theinterpretations made from such de-vices. Actual studies reported in the

Page 2: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

ANTECEDENT PROBABILITY AND PSYCHOMETRIC SIGNS 195

literature will be used for illustrationwherever possible. It should beemphasized that these particular il-lustrative studies of common prac-tices were chosen simply because theycontained more complete data thanare commonly reported, and wereavailable in fairly recent publications.

IMPORTANCE OF BASE RATES

Danielson and Clark (4) have re-ported on the construction and appli-cation of a personality inventorywhich was devised for use in militaryinduction stations as an aid in detect-ing those men who would not com-plete basic training because of psy-chiatric disability or AWOL recidi-vism. One serious defect in theirarticle is that it reports cutting lineswhich have not been cross validated.Danielson and Clark state that in-ductees were administered the FortOrd Inventory within two days afterinduction into the Army, and thatall of these men were allowed to un-dergo basic training regardless of theirtest scores.

Two samples (among others) ofthese inductees were selected for thestudy of predictive validity: (a) Agroup of 415 men who had made agood adjustment (Good AdjustmentGroup), and (&) a group of 89 menwho were unable to complete basictraining and who were sufficientlydisturbed to warrant a recommenda-tion for discharge by a psychiatrist(Poor Adjustment Group). The au-thors state that "the most importanttask of a test designed to screen outmisfits is the detection of the (latter)group" (4, p. 139). The authorsfound that their most effective scalefor this differentiation picked up, ata given cutting point, 55% of thePoor Adjustment Group (valid posi-tives) and 19% of the Good Adjust-ment Group (false positives). Theoverlap between these two groups

would undoubtedly have been greaterif the cutting line had been crossvalidated on a random sample fromthe entire population of inductees, butfor the purposes of the present dis-cussion, let us assume that the re-sults were obtained from cross-vali-dation groups. There is no mentionof the percentage of all inductees whofall into the Poor Adjustment Group,but a rough estimate will be adequatefor the present discussion. Supposethat in their population of soldiers,as many as 5% make a poor adjust-ment and 95% make a good adjust-ment. The results for 10,000 caseswould be as depicted in Table 1.

TABLE 1NUMBER OF INDUCTEES IN THE POOR ADJUST-MENT AND GOOD ADJUSTMENT GROUPS

DETECTED BY A SCREENINGINVENTORY

(55% valid positives; 19% false positives)

Actual Adjustment

Predicted Poor Good

No. % No. %

TV.t!>1

Pre-

Poor 275 55 1,805 19 2,080Good 225 45 7,695 81 7,920Total actual 500 100 9,500 100 10,000

Efficiency in detecting poor adjust-ment cases. The efficiency of thescale can be evaluated in severalways. From the data in Table 1it can be seen that if the cutting linegiven by the authors were used atFort Ord, the scale could not be useddirectly to "screen out misfits." Ifall those predicted by the scale tomake a poor adjustment were screenedout, the number of false positiveswould be extremely high. Among the10,000 potential inductees, 2080would be predicted to make a pooradjustment. Of these 2080, only 275,or 13%, would actually make a pooradjustment, whereas the decisions

Page 3: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

196 PAUL E. MEEHL AND ALBERT ROSEN

for 1805 men, or 87% of thosescreened out, would be incorrect.

Efficiency in prediction for all cases.If a prediction were made for everyman on the basis of the cutting linegiven for the test, 275+7695, or7970, out of 10,000 decisions would becorrect. Without the test, however,every man would be predicted tomake a good adjustment, and 9500of the predictions would be correct.Thus, use of the test has yielded adrop from 95% to 79.7% in the totalnumber of correct decisions.

Efficiency in detecting good adjust-ment cases. There is one kind of de-cision in which the Inventory canimprove on the base rates, however.If only those men are accepted whoare predicted by the Inventory tomake a good adjustment, 7920 will beselected, and the outcome of 7695of the 7920, or 97%, will be pre-dicted correctly. This is a 2% in-crease in hits among predictions of"success." The decision as to whetheror not the scale improves on the baserates sufficiently to warrant its usewill depend on the cost of administer-ing the testing program, the adminis-trative feasibility of rejecting 21%of the men who passed the psychiatricscreening, the cost to the Army oftraining the 225 maladaptive recruits,and the intangible human costs in-volved in psychiatric breakdown.

Populations to which the scale isapplied. In the evaluation of theefficiency of any psychometric in-strument, careful consideration mustbe given to the types of populationsto which the device is to be applied.Danielson and Clark have statedthat "since the final decision as todisposition is made by the psychia-trist, the test should be classified asa screening adjunct" (4, p. 138).This statement needs clarification,however, for the efficiency of thescale can vary markedly according

to the different ways in which itmight be used as an adjunct.

It will be noted that the test wasadministered to men who were al-ready in the Army, and not to menbeing examined for induction. Thereported validation data apply,therefore, specifically to the popula-tion of recent inductees. The resultsmight have been somewhat differentif the population tested consisted ofpotential inductees. For the sake ofillustration, however, let us assumethat there is no difference in the testresults of the two populations.

An induction station psychiatristcan use the scale cutting score inone or more of the following ways,i.e., he can apply the scale results to avariety of populations, (a) The psy-chiatrist's final decision to accept orreject a potential inductee may bebased on both the test score and hisusual interview procedure. The popu-lation to which the test scores areapplied is, therefore, potential in-ductees interviewed by the usual pro-cedures for whom no decision was made,(b) He may evaluate the potentialinductee according to his usual pro-cedures, and then consult the testscore only if the tentative decisionis to reject. That is, a decision toaccept is final. The population towhich the test scores are applied ispotential inductees tentatively rejectedby the usual interview procedures, (c)An alternative procedure is for thepsychiatrist to consult the test scoreonly if the tentative decision is toaccept, the population being potentialinductees tentatively accepted by theusual interview procedures. The de-cision to reject is final, (d) Probablythe commonest proposal for the useof tests as screening adjuncts is thatthe more skilled and costly psychia-tric evaluation should be made onlyupon the test positives, i.e., induc-tees classified by the test as good

Page 4: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

ANTECEDENT PROBABILITY AND PSYCHOMETRIC SIGNS 197

risks are not interviewed, or are sub-jected only to a very short and super-ficial interview. Here the populationis all potential inductees, the test beingused to make either a final decisionto "accept" or a decision to "exam-ine."

Among these different procedures,how is the psychiatrist to achievemaximum effectiveness in using thetest as an adjunct? There is no an-swer to this question from the avail-able data, but it can be stated defi-nitely that the data reported byDanielson and Clark apply only tothe third procedure described above.The test results are based on a se-lected group of men accepted for in-duction and not on a random sampleof potential inductees. If the scaleis used in any other way than thethird procedure mentioned above, theresults may be considerably inferiorto those reported, and, thus, to theuse of the base rates without thetest.2

The principles discussed thus far,although illustrated by a single study,can be generalized to any study ofpredictive or concurrent validity. Itcan be seen that many considerationsare involved in determining theefficiency of a scale at a given cut-ting score, especially the base rates ofthe subclasses within the populationto which the psychometric deviceis to be applied. In a subsequentportion of this paper, methods willbe presented for determining cuttingpoints for maximizing the efficiencyof the different types of decisionswhich are made with psychometricdevices.

Another study will be utilized toillustrate the importance of an explicitstatement of the base rates of popu-

1 Goodman (8) has discussed this sameproblem with reference to the supplementaryuse of an index for the prediction of paroleviolation.

lation subgroups to be tested with agiven device. Employing an interest-ing configural approach, Thiesen(18) discovered five Rorschach pat-terns, each of which differentiatedwell between 60 schizophrenic adultpatients and a sample of 157 gainfullyemployed adults. The best differen-tiator, considering individual pat-terns or number of patterns, wasPattern A, which was found in 20%of the patients' records and in only.6% of the records of normals. Thie-sen concludes that if these patternsstand the test of cross validation,they might have "clinical usefulness"in early detection of a schizophrenicprocess or as an aid to determiningthe gravity of an initial psychoticepisode (18, p. 369). If by "clinicalusefulness" is meant efficiency in aclinic or hospital for the diagnosis ofschizophrenia, it is necessary todemonstrate that the patterns dif-ferentiate a higher percentage ofschizophrenic patients from otherdiagnostic groups than could be cor-rectly classified without any testat all, i.e., solely on the basis of therates of various diagnoses in anygiven hospital. If a test is to be usedin differential diagnosis among psy-chiatric patients, evidence of itsefficiency for this function cannot beestablished solely on the basis of dis-crimination of diagnostic groups fromnormals. If by "clinical usefulness"Thiesen means that his data indicatethat the patterns might be used todetect an early schizophrenic processamong nonhospitalized gainfully em-ployed adults, he would do better todiscard his patterns and use the baserates, as can be seen from the follow-ing data.

Taulbee and Sisson (17) cross vali-dated Thiesen's patterns on schizo-phrenic patient and normal samples,and found that Pattern A was thebest discriminator. Among patients,

Page 5: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

198 PAUL E. MEEHL AND ALBERT ROSEN

8.1% demonstrated this pattern andamong normals, none had this pat-tern. There are approximately 60million gainfully employed adults inthis country, and it has been esti-mated that the rate of schizophreniain the general population is approxi-mately .85% (2, p. 558). The resultsfor Pattern A among a populationof 10,000 gainfully employed adultswould be as shown in Table 2. Inorder to detect 7 schizophrenics, itwould be necessary to test 10,000 indi-viduals.

TABLE 2NUMBER OF PERSONS CLASSIFIED AS SCHIZO-PHRENIC AND NORMAL BY A TEST PATTERN

AMONG A POPULATION OF GAINFULLYEMPLOYED ADULTS

(8.1% valid positives; 0.0% false positives)

Classifica-tion byTest

Schizo-phrenia

NormalTotal in

class

Criterion Classification

Schizo-phrenia

No.

778

81

%

8.191.9

100

Normal

No.

09,915

9,915

%

0100

100

TotalClassi-fied by

Test

79,993

10,000

In the Neurology service of a hospi-tal a psychometric scale is used whichis designed to differentiate betweenpatients with psychogenic and organ-ic low back pain (9). At a given cut-ting point, this scale was found toclassify each group with approxi-mately 70% effectiveness upon crossvalidation, i.e., 70% of cases withno organic findings scored above anoptimal cutting score, and 70% ofsurgically verified organic casesscored below this line. Assume that90% of all patients in the Neurologyservice with a primary complaint oflow back pain are in fact "organic."Without any scale at all the psychol-

ogist can say every case is organic,and be right 90% of the time. Withthe scale the results would be asshown in Section A of Table 3. Of10 psychogenic cases, 7 score abovethe line; of 90 organic cases, 63 scorebelow the cutting line. If every caseabove the line is called psychogenic,only 7 of 34 will be classified correctlyor about 21%. Nobody wants to beright only one out of five times in thistype of situation, so that it is ob-vious that it would be imprudent tocall a patient psychogenic on the basisof this scale. Radically differentresults occur in prediction for casesbelow the cutting line. Of 66 cases63, or 95%, are correctly classifiedas organic. Now the psychologisthas increased his diagnostic hits from90 to 95% on the condition that helabels only cases falling below theline, and ignores the 34% scoringabove the line.

TABLE 3NUMBER OF PATIENTS CLASSIFIED AS PSYCHO-GENIC AND ORGANIC ON A Low BACK PAIN

SCALE WHICH CLASSIFIES CORRECTLY70% OF PSYCHOGENIC AND ORGANIC

CASES

Actual Diagnosis TotalClassified

bygenie ~° Scale

Classificationby Scale Psycho- Organ;c

A, Base Rates in Population Tested:90% Organic; 10% Psychogenic

PsychogenicOrganicTotal diagnosed

73

10

276390

3466

100

B. Base Rates in Population Tested:90% Psychogenic; 10% Organic

Psychogenic 63Organic 27Total diagnosed 90

3 667 34

10 100

In actual practice, the psychologistmay not, and most likely will not,

Page 6: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

ANTECEDENT PROBABILITY AND PSYCHOMETRIC SIGNS 199

test every low back pain case. Prob-ably those referred for testing will bea select group, i.e., those who theneurologist believes are psychogenicbecause neurological findings are mini-mal or absent. This fact changes thepopulation from "all patients inNeurology with a primary complaintof low back pain," to "all patientsin Neurology with a primary com-plaint of low back pain who are re-ferred for testing." Suppose that aa study of past diagnoses indicatedthat of patients with minimal or ab-sent findings, 90% were diagnosed aspsychogenic and 10% as organic.Section B of Table 3 gives an entirelydifferent picture of the effectivenessof the low back pain scale, and newlimitations on interpretation are ne-cessary. Now the scale correctlyclassifies 95% of all cases above theline as psychogenic (63 of 66), andis correct in only 21% of all casesbelow the line (7 of 34). In thispractical situation the psychologistwould be wise to refrain from inter-preting a low score.

From the above illustrations itcan be seen that the psychologist ininterpreting a test and in evaluatingits effectiveness must be very muchaware of the population and itssubclasses and the base rates of thebehavior or event with which he isdealing at any given time.

It may be objected that no clini-cian relies on just one scale but woulddiagnose on the basis of a configura-tion of impressions from severaltests, clinical data and history. Wemust, therefore, emphasize that thepreceding single-scale examples werepresented for simplicity only, but thatthe main point is not dependent uponthis "atomism." Any complex con-figurational procedure in any numberof variables, psychometric or otherwise,eventuates in a decision. Those de-cisions have a certain objective suc-

cess rate in criterion case identifica-tion; and for present purposes wesimply treat the decision function,whatever its components and com-plexity may be, as a single variable.It should be remembered that theliterature does not present us withcross-validated methods having hitrates much above those we havechosen as examples, regardless ofhow complex or configural the meth-ods used. So that even if the clinicianapproximates an extremely complexconfigural function "in his head"before classifying the patient, forpurposes of the present problem thiscomplex function is treated as thescale. In connection with the moregeneral "philosophy" of clinical de-cision making see Bross (3) andMeehl(12).

APPLICATIONS OF BAYES'THEOREM

Many readers will recognize thepreceding numerical examples asessentially involving a principle ofelementary probability theory, theso-called "Bayes" Theorem." Whileit has come in for some opprobriumon account of its connection withcertain pre-Fisherian fallacies in sta-tistical inference, as an algebraicstatement the theorem has, of course,nothing intrinsically wrong with itand it does apply in the present case.One form of it may be stated as fol-lows:

If there are k antecedent condi-tions under which an event of a givenkind may occur, these conditions hav-ing the antecedent probabilities Pi,P», • • ' , Pt of being realized, andthe probability of the event uponeach of them is pi, pi, pi, • • • , pk',then, given that the event is observedto occur, the probability that itarose on the basis of a specified one,say j, of the antecedent conditionsis given by

Page 7: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

200 PAUL E. MEEHL AND ALBERT ROSEN

PH.

The usual illustration is the caseof drawing marbles from an urn.Suppose we have two urns, and theurn-selection procedure is such thatthe probability of our choosing thefirst urn is 1/10 and the second 9/10.Assume that 70% of the marblesin the first urn are black, and 40% ofthose in the second urn are black. Inow (blindfolded) "choose" an urnand then, from it, I choose a marble.The marble turns out to be black.What is the probability that I drewfrom the first urn?

i =.10 PS=.90

pt =.40

Then

Pi (6)=- ..163.

If I make a practice of inferring undersuch circumstances that an observedblack marble arose from the firsturn, I shall be correct in such judg-ments, in the long run, only 16.3%of the time. Note, however, that the"test item" or "sign" black marbleis correctly "scored" in favor of UrnNo. 1, since there is a 30% differencein black marble rate between it andUrn No. 2. But this considerabledisparity in symptom rate is over-come by the very low base rate("antecedent probability of choosingfrom the first urn"), so that inferenceto first-urn origin of black marbleswill actually be wrong some 84 timesin 100. In the clinical analogue, theurns are identified with the subpopu-lations of patients to be discriminated(their antecedent probabilities beingequated to their base rates in thepopulation to be examined), and theblack marbles are test results of a

certain ("positive") kind. The pro-portion of black marbles in one urnis the valid positive rate, and in theother is the false positive rate. In-spection and suitable manipulationsof the formula for the common two-category case, viz.,

P(o) —Ppi+Qpi

Pd«» = Probability that an individ-ual is diseased, given thathis observed test score ispositive

P = Base rate of actual positivesin the population examined

pi = Proportion of diseased iden-tified by test ("valid posi-tive" rate)

qi=l-pipi = Proportion of nondiseased

misidentified by test as beingdiseased ("false positive"rate)

yields several useful statements. Notethat in what follows we are operatingentirely with exact population param-eter values; i.e., sampling errorsare not responsible for the dangersand restrictions set forth. See Table4.

1. In order for a positive diagnosticassertion to be "more likely true thanfalse," the ratio of the positive to thenegative base rates in the examinedpopulation must exceed the ratio ofthe false positive rate to the validpositive rate. That is,

Q PiIf this condition is not met, the

attribution of pathology on the basisof the test is more probably in errorthan correct, even though the signbeing used is valid (i.e.,

Page 8: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

ANTECEDENT PROBABILITY AND PSYCHOMETRIC SIGNS 201

TABLE 4

DEFINITION OF SYMBOLS

DiagnosisfromTest Positive

Actual Diagnosis

Negative

Pi PiPositive Valid positive False positive

rate (Proportion rate (Proportionof positives of negativescalled positive) called positive)

Negative False negative Valid negativerate (Proportion rate (Proportionof positives of negativescalled negative) called negative)

Total £i+gi = 1.0 £2+32 = 1.0withactu- (Total posi- (Total nega-al diag- tives) tives)

Note.—For simplicity, the term "diagnosis" is usedto denote the classification of any kind of pathology, be-havior, or event being studied, or to denote "outcome"if a test is used for prediction. Since horizontal addition(e.g., pi-\-pi) is meaningless in ignorance of the baserates, there is no symbol or marginal total for thesesums. All values are parameter values.

Example: If a certain cutting scoreidentifies 80% of patients with organ-ic brain damage (high scores beingindicative of damage) but is also ex-ceeded by 15% of the nondamagedsent for evaluation, in order for thepsychometric decision "brain dam-age present" to be more often truethan false, the ratio of actually brain-damaged to nondamaged cases amongall seen for testing must be at leastone to five (.19).

Piotrowski has recommended thatthe presence of 5 or more Rorschachsigns among 10 "organic" signs is anefficient indicator of brain damage.Dorken and Krai (5), in cross validat-ing Piotrowski's index, found that63% of organics and 30% of a mixed,nonorganic, psychiatric patient grouphad Rorschachs with 5 or moresigns. Thus, our estimate of pi/pi= .30/.63 = .48, and in order for thedecision "brain damage present" tobe correct more than one-half the

time, the proportion of positives (P)in a given population must exceed.33 (i.e., P/<?>.33/.67). Since fewclinical populations requiring thisclinical decision would have such ahigh rate of brain damage, especiallyamong psychiatric patients, the par-ticular cutting score advocated byPiotrowski will produce an excessivenumber of false positives, and thepositive diagnosis will be more oftenwrong than right. Inasmuch as thebase rates for any given behavior orpathology differ from one clinicalsetting to another, an inflexible cut-ting score should not be advocated forany psychometric device. This state-ment applies generally—thus, toindices recommended for such di-verse purposes as the classificationor detection of deterioration, specificsymptoms, "traits," neuroticism, sex-ual aberration, dissimulation, sui-cide risk, and the like. When P issmall, it may be advisable to explorethe possibility of dealing with a re-stricted population within which thebase rate of the attribute being testedis higher. This approach is discussedin an article by Rosen (14) on thedetection of suicidal patients in whichit is suggested that an attempt mightbe made to apply an index to sub-populations with higher suicide rates.

2. If the base rates are equal, theprobability of a positive diagnosisbeing correct is the ratio of validpositive rate to the sum of validand false positive rates. That is,

if

Example: If our population isevenly divided between neurotic andpsychotic patients the condition forbeing "probably right" in diagnosingpsychosis by a certain method issimply that the psychotics exhibitthe pattern in question more fre-

Page 9: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

202 PAUL E. MEEHL AND ALBERT ROSEN

quently than the neurotics. This isthe intuitively obvious special case;it is often misgeneralized to justifyuse of the test in those cases wherebase-rate asymmetry (P^Q) coun-teracts the (pi — pz) discrepancy, lead-ing to the paradoxical consequencethat deciding on the basis of more in-formation can actually worsen thechances of a correct decision. Theapparent absurdity of such an ideahas often misled psychologists intobehaving as though the establish-ment of "validity" or "discrimina-tion," i.e., that pi^p^, indicatesthat a procedure should be used indecision making.

Example: A certain test is usedto select those who will continue inoutpatient psychotherapy (positives).It correctly identifies 75% of thesegood cases but the same cuttingscore picks up 40% of the poor riskswho subsequently terminate againstadvice. Suppose that in the pastexperience of the clinic 50% of thepatients terminated therapy pre-maturely. Correct selection of pa-tients can be made with the givencutting score on the test 65% of thetime, since pi/(pi+pz)=.75/(.75+.40) = .65. It can be seen that theefficiency of the test would be exag-gerated if the base rate for continua-tion in therapy were actually .70,but the efficiency were evaluatedsolely on the basis of a research studycontaining equal groups of continuersand noncontinuers, i.e., if it wereassumed that P = .50.

3. In order for the hits in the en-tire population which is under con-sideration to be increased by use ofthe test, the base rate of the morenumerous class (called here positive)must be less than the ratio of thevalid negative rate to the sum ofvalid negative and false negativerates. That is, unless

the making of decisions on the basisof the test will have an adverseeffect. An alternative expression isthat (P/Q) <(qt/qi) when P>Q, i.e.,the ratio of the larger to the smallerclass must be less than the ratio ofthe valid negative rate to the falsenegative rate. When P<Q, the con-ditions for the test to improve uponthe base rates are:

Q<-

and

P p*

Rotter, Rafferty, and Lotsof (15)have reported the scores on a sentencecompletion test for a group of 33"maladjusted" and 33 "adjusted"girls. They report that the use of aspecified cutting score (not crossvalidated) will result in the correctclassification of 85% of the malad-justed girls and the incorrect classifi-cation of only 15% of the adjustedgirls. It is impossible to evaluateadequately the efficiency of the testunless one knows the base rates ofmaladjustment (P) and adjustment((?) for the population of high schoolgirls, although there would be generalagreement that Q>P. Since pi/(pi+£2) = .85/(.85 + .15)=.85, the over-all hits in diagnosis with the testwill not improve on classificationbased solely on the base rates unlessthe proportion of adjusted girls isless than .85. Because the reportedeffectiveness of the test is spuriouslyhigh, the proportion of adjustedgirls would no doubt have to beconsiderably less than .85. Unlessthere is good reason to believe that

Page 10: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

ANTECEDENT PROBABILITY AND PSYCHOMETRIC SIGNS 203

the base rates are similar from onesetting to another, it is impossible todetermine the efficiency of a testsuch as Rotter's when the criterion isbased on ratings unless one replicateshis research, including the criterionratings, with a representative sampleof each new population.

4. In altering a sign, improving ascale, or shifting a cutting score, theincrement in valid positives per incre-ment in valid positive rate is propor-tional to the positive base rate; andanalogously, the increment in validnegatives per increment in validnegative rate is proportional to thenegative base rate. That is, if wealter a sign the net improvement inover-all hit rate is

where HT — original proportion ofhits (over-all) and H'T = new propor-tion of hits (over-all).

5. A corollary of this is that alter-ing a sign or shifting a cut will im-prove our decision making if, andonly if, the ratio of improvement,kpi in valid positive rate to worseningAp2 in false negative rate exceeds theratio of actual negatives to positivesin the population.

Example: Suppose we improve theintrinsic validity of a certain "schizo-phrenic index" so that it now de-tects 20% more schizophrenics thanit formerly did, at the expense ofonly a 5% increase in the false posi-tive rate. This surely looks encourag-ing. We are, however, working withan outpatient clientele only l/10thof whom are actually schizophrenic.Then, since

(?=.90A/>2=.05

applying the formula we see that

JJO J»

1)5 lo

i.e., the required inequality does nothold, and the routine use of this"improved" index will result in anincrease in the proportion of errone-ous diagnostic decisions.

In the case of any pair of unimodaldistributions, this corresponds to theprinciple that the optimal cut liesat the intersection of the two distribu-tion envelopes (11, pp. 271-272).

MANIPULATION OF CUTTINGLINES FOR DIFFERENT

DECISIONSFor any given psychometric de-

vice, no one cutting line is maximallyefficient for clinical settings in whichthe base rates of the criterion groupsin the population are different. Fur-thermore, different cutting lines maybe necessary for various decisionswithin the same population. In thissection, methods are presented formanipulating the cutting line of anyinstrument in order to maximizethe efficiency of a device in the mak-ing of several kinds of decisions.Reference should be made to thescheme presented in Table 5 for un-derstanding of the discussion whichfollows. This scheme and the meth-ods for manipulating cutting linesare derived from Duncan, Ohlin,Reiss, and Stanton (6).

A study in the prediction of juve-nile delinquency by Glueck andGlueck(7) will be used for illustration.Scores on a prediction index for 451delinquents and 439 nondelinquents(7, p. 261) are listed in Table 6. Ifthe Gluecks' index is to be used in apopulation with a given juveniledelinquency rate, cutting lines canbe established to maximize the effi-ciency of the index for several de-

Page 11: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

204 PAUL E. MEEHL AND ALBERT ROSEN

TABLE S

SYMBOLS TO BE USED IN EVALUATING THE EFFICIENCY OF A PSYCHOMETRIC DEVICEIN CLASSIFICATION OR PREDICTION

DiagnosisfromTest

Actual Diagnosis

Positive Negative

Total DiagnosedfromTest

Positive (Number of valid posi-tives)

NQp,(Number of false posi-

tives)(Number of test posi-

tives)

Negative (Number of false nega-tives)

NQg,(Number of valid nega-

tives)

NPqi+NQqi(Number of test nega-

tives)

Total with actual NP NQ Ndiagnosis (Number of actual pos- (Number of actual neg- (Total number of cases)

itives) atives)

Note.—For simplicity, the term "diagnosis" is used to denote the classification of any kind of pathology, be-havior, or event studied, or to denote "outcome" if a test is used for prediction. "Number" means absolute frequency,not rate or probability.

cisions. In the following illustration,a delinquency rate of .20 will be used.From the data in Table 6, optimalcutting lines will be determined formaximizing the proportion of correctpredictions, or hits, for all cases(Hr), and for maximizing the pro-portion of hits (Hp) among thosecalled delinquent (positives) by theindex.

In the first three columns of Table6, "/" denotes the number of de-linquents scoring in each class inter-val, "cf" represents the cumulative

frequency of delinquents scoringabove each class interval (e.g., 265score above 299), and pi representsthe proportion of the total groupof 451 delinquents scoring aboveeach class interval. Columns 4, 5,and 6 present the same kind of datafor the 439 nondelinquents.

Maximizing the number of correctpredictions or classifications for allcases. The proportion of correctpredictions or classifications (Hr) forany given cutting line is given by theformula, Hr = Ppi-\-Qq^. Thus, in

TABLE 6

PREDICTION INDEX SCORES FOR JUVENILE DELINQUENTS AND NONDELINQUENTS ANDOTHER STATISTICS FOR DETERMINING OPTIMAL CUTTING LINES FOR CERTAIN

DECISIONS IN A POPULATION WITH A DELINQUENCY RATE OF .20*

Delinquents

PredictionIndexScore

400+350-399300-349250-299200-249150-199

<150

(1)

/

5173

14112240195

(2)

cf

51124265387427446451

cf

451

(3)

f t

.1131

.2749

.5876

.8581

.9468

.98891 .0000

Nondelinquents

(4)

/

18

237068

102167

(5)

cf

19

32102170272439

cf

439

(6)

t,

.0023

.0205

.0729

.2323

.3872

.61961.0000

1 -P.

m«.

.9977

.9795

.9271

.7677

.6128

.3804

.0000

.2*

(8)

P#.

.0226

.0550.1175.1716.1894.1978.2000

.8,,

(9)

Qtt,0018.0164.0583.1858.3098.4957.8000

.84 1

(10)

QQ,.7982.7836.7417.6142.4902.3043.0000

67,"(11)

HT.821.839.859.786.680.502.200

H a7,'+

(12)

Rp.024.071.176.357.499.694

1.000

P*iRp

(13)

HP

.926

.770

.668

.480

.379

.285

.200

* Frequencies in columns 1 and 4 are from Glueck and Glueck (7, p. 261).

Page 12: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

ANTECEDENT PROBABILITY AND PSYCHOMETRIC SIGNS 205

column 11 of Table 6, labelled HT,it can be seen that the best cuttingline for this decision would be be-tween 299 and 300, for 85.9% of allpredictions would be correct if thoseabove the line were predicted tobecome delinquent and all those be-low the line nondelinquent. Anyother cutting line would result in asmaller proportion of correct predic-tions, and, in fact, any cutting lineset lower than this point would makethe index inferior to the use of thebase rates, for if all cases were pre-dicted to be nondelinquent, the totalproportion of hits would be .80.

Maximizing the number of correctpredictions or classifications for posi-tives. The primary use of a predictiondevice may be for selection of (a)students who will succeed in a train-ing program, (b) applicants who willsucceed in a certain job, (c) patientswho will benefit from a certain typeof therapy, etc. In the present illus-tration, the index would most likelybe used for detection of those who arelikely to become delinquents. Thus,the aim might be to maximize thenumber of hits only within thegroup predicted by the index to be-come delinquents (predicted posi-tives = NPpi+NQp<i). The propor-tion of correct predictions for thisgroup by the use of different cuttinglines is given in column 13, labelledHp. Thus, if a cutting line is setbetween 399 and 400, one will becorrect over 92 times in 100 if pre-dictions are made only for personsscoring above the cutting line. Theformula for determining the efficiencyof the test when only positive predic-tions are made is Hp — Pp\/(Ppi

tempted, the proportion of thisselected group in the total samplemay be considered as a selectionratio. The selection ratio for positivesis Rp = Ppi + Qp2, that is, predictionsare made only for those above thecutting line. The selection ratio foreach posssible cutting line is shownin column 12 of Table 6, labelledRP. It can be seen that to obtainmaximum accuracy in selection ofdelinquents (92.6%), predictions canbe made for only 2.4% of the popula-tion. For other cutting lines, theaccuracy of selection and the cor-responding selection ratios are givenin Table 6. The worker applying theindex must use his own judgment indeciding upon the level of accuracyand the selection ratio desired.

Maximizing the number of correctpredictions or classifications for nega-tives. In some selection problems, thegoal is the selection of negativesrather than positives. Then, the pro-portion of hits among all predictednegative for any given cutting lineis HN = Q&/(Q&+P(li) , and the selec-tion ratio for negatives is Rtf — Pqi

One has to pay a price for achievinga very high level of accuracy withthe index. Since the problem is toselect potential delinquents so thatsome sort of therapy can be at-

In all of the above manipulations ofcutting lines, it is essential that therebe a large number of cases. Other-wise, the percentages about any givencutting line would be so unstable thatvery dissimilar results would be ob-tained on new samples. For moststudies in clinical psychology, there-fore, it would be necessary to estab-lish cutting lines according to thedecisions and methods discussedabove, and then to cross validate aspecific cutting line on new samples.

The amount of shrinkage to beexpected in the cross validation ofcutting lines cannot be determineduntil a thorough mathematical andstatistical study of the subject ismade. It may be found that whencriterion distributions are approxi-

Page 13: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

206 PAUL E. MEEHL AND ALBERT ROSEN

TABLE 7PERCENTAGE OF DELINQUENTS (D) AND NONDELINQUENTS (ND) IN EACH PREDICTIONINDEX SCORE INTERVAL IN A POPULATION IN WHICH THE DELINQUENCY RATE Is .20*

PredictionIndexScore

Interval

400+350-399300-349250-299200-249150-199

<150

No. ofD

5173

14112240195

No. ofND

43395

288279419686

Total ofD and ND

55106236410319438691

%of Din ScoreInterval

92.768.959.729.812.54.3

.7

% of NDin ScoreInterval

7.331.140.370.287.595.799.3

%of Dand NDin ScoreInterval

100100100100100100100

Total 451 1804 2255

* Modification of Table XX-2, p. 261, from Glueck and Glueck (7).

mately normal and large, cuttinglines should be established in termsof the normal probability tablerather than on the basis of the ob-served p and q values found in thesamples. In a later section dealingwith the selection ratio we shall seethat it is sometimes the best proced-ure to select all individuals fallingabove a certain cutting line and toselect the others needed to reach theselection ratio by choosing at randombelow the line; or in other cases toestablish several different cuts de-nning ranges within which one orthe opposite decision should bemade.

Decisions based on score intervalsrather than cutting lines. The Gluecks'data can be used to illustrate anotherapproach to psychometric classifica-tion and prediction when scores forlarge samples are available with arelatively large number of cases ineach score interval. In Table 7 arelisted frequencies of delinquents andnondelinquents for prediction indexscore intervals. The frequencies fordelinquents are the same as those inTable 6, whereas those for nondelin-quents have been corrected for abase rate of .20 by multiplying each

frequency in column 4 of Table 6s

by

4.11 =(.80) (459)

= (.20) (431)

Table 7 indicates the proportion ofdelinquents and nondelinquentsamong all juveniles who fall withina given score interval when thebase rate of delinquency is .20. Itcan be predicted that of those scoring400 or more, 92.7% will become de-linquent, of those scoring between 350and 399, 68.9% will be delinquent,etc. Likewise, of those scoring be-tween 200 and 249, it can be predictedthat 87.5% will not become delin-quent. Since 80% of predictions willbe correct without the index if allcases are called nondelinquent, onewould not predict nondelinquencywith the index in score intervals over249. Likewise, it would be best notto predict delinquency for individuals

« The Gluecks' Tables XX-2, 3, 4, 5, (7,pp. 261-262) and their interpretations there-from are apt to be misleading because of theirexclusive consideration of approximatelyequal base rates of delinquency and non-delinquency. Reiss (13), in his review of theGluecks' study, has also discussed their useof an unrepresentative rate of delinquency.

Page 14: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

ANTECEDENT PROBABILITY AND PSYCHOMETRIC SIGNS 207

in the intervals under 250 because20% of predictions will be correct ifthe base rate is used.

It should be emphasized that thereare different ways of quantifyingone's clinical errors, and they will,of course, not all give the same evalu-ation when applied in a given setting."Per cent valid positives" ( = pi) israrely if ever meaningful without thecorrelated "per cent false positives"(=£2), and clinicians are accustomedto the idea that we pay for an in-crease in the first by an increase inthe second, whenever the increase isachieved not by an improvement inthe test's intrinsic validity but by ashifting of the cutting score. But thetwo quantities pi and fa do not de-fine our over-all hit frequency, whichdepends also upon the base ratesP and Q. The three quantities p\,pz, and P do, however, contain allthe information needed to evaluatethe test with respect to any givensign or cutting score that yields thesevalues. Although pit p^, and P con-tain the relevant information, otherforms of it may be of greater impor-tance. No two of these numbers, forexample, answer the obvious ques-tion most commonly asked (or vague-ly implied) by psychiatrists when aninference is made from a sign, viz.,"How sure can you be on the basisof that sign?" The answer to thiseminently practical query involvesa probability different from any ofthe above, namely, the inverse prob-ability given by Bayes' formula:

Ppi+Qpi

Even a small improvement in thehit frequency to H'T — Pp\ + Qqi overthe HT = P attainable without thetest may be adjudged as worth whilewhen the increment A-ffr is multi-plied by the N examined in the course

of one year and is thus seen to in-volve a dozen lives or a dozen curableschizophrenics. On the other hand,the simple fact that an actual shrink-age in total hit rate may occur seemsto be unappreciated or tacitly ignoredby a good deal of clinical practice.One must keep constantly in mindthat numerous diagnostic, prognostic,and dynamic statements can be madeabout almost all neurotic patients(e.g., "depressed," "inadequate abil-ity to relate," "sexual difficulties")or about very few patients (e.g.,"dangerous," "will act out in ther-apy," "suicidal," "will blow up intoa schizophrenia"). A psychologistwho uses a test sign that even crossvalidates at £1 = 52 = 80% to deter-mine whether "depression" is presentor absent, working in a clinical popu-lation where practically everyone isfairly depressed except a few psy-chopaths and old-fashioned hysterics,is kidding himself, the psychiatrist,and whoever foots the bill.

"SUCCESSIVE-HURDLES"APPROACH

Tests having low efficiency, orhaving moderate efficiency but ap-plied to populations having very un-balanced base rates (P4C® are some-times defended by adopting a "crudeinitial screening" frame of reference,and arguing that certain other pro-cedures (whether tests or not) canbe applied to the subset identifiedby the screener ("successive hur-dles"). There is no question that insome circumstances (e.g., militaryinduction, or industrial selection witha large labor market) this is athoroughly defensible position. How-ever, as a general rule one shouldexamine this type of justificationcritically, with the preceding con-siderations in mind. Suppose wehave a test which distinguishes brain-tumor from non-brain-tumor pa-

Page 15: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

208 PAUL E. MEBHL AND ALBERT ROSEN

tients with 75% accuracy and nodifferential bias (/'i = gr2 = .75). Un-der such circumstances the test hitrate HT is .75 regardless of the baserate. If we use the test in makingour judgments, we are correct in ourdiagnoses 75 times in 100. But sup-pose only one patient in 10 actuallyhas a brain tumor, we will drop ourover-all "success" from 90% (at-tainable by diagnosing "No tumor"in all cases) to 75%. We do, however,identify 3 out of 4 of the real braintumors, and in such a case it seemsworth the price. The "price" hastwo aspects to it: We take timeto give the test, and, having givenit, we call many "tumorous" who arenot. Thus, suppose that in the courseof a year we see 1000 patients. Ofthese, 900 are non-tumor, and weerroneously call 225 of these "tumor."To pick up (100) (.75) = 75 of thetumors, all 100 of whom would havebeen called tumor-free using thebase rates alone, we are willing tomislabel 3 times this many as tumor-ous who are actually not. Putting itanother way, whenever we say"tumor" on the basis of the test, thechances are 3 to 1 that we are mis-taken. When we "rule out" tumorby the test, we are correct 96% ofthe time, an improvement of only 6%in the confidence attachable to anegative finding over the confidenceyielded by the base rates.4

Now, picking up the successive-hurdles argument, suppose a majordecision (e.g., exploratory surgery)is allowed to rest upon a second test

4 Improvements are expressed throughoutthis article as absolute increments in percent-age of hits, because: (a) This avoids the com-plete arbitrariness involved in choosing be-tween original hit rate and miss rate as start-ing denominator; and (6) for the clinician,the person is the most meaningful unit ofgain, rather than a proportion of a proportion(especially when the reference proportion isvery small).

which is infallible but for practicallyinsuperable reasons of staff, time,etc., cannot be routinely given. Weadminister Test 2 only to "positives"on (screening) Test 1. By this tacticwe eliminate all 225 false positivesleft by Test 1, and we verify the 75valid positives screened in by Test1. The 25 tumors that slippedthrough as false negatives on Test 1are, of course, not picked up by Test2 either, because it is not applied tothem. Our total hit frequency is now97.5%, since the only cases ultimate-ly misclassified out of our 1000 seenare these 25 tumors which escapedthrough the initial sieve Test 1. Weare still running only 7§% above thebase rate. We have had to give ourshort-and-easy test to 1000 indi-viduals and our cumbersome, expen-sive test to 300 individuals, 225of whom turn out to be free of tumor.But we have located 75 patients withtumor who would not otherwisehave been found.

Such examples suggest that, ex-cept in "life-or-death" matters, thesuccessive-screenings argument mere-ly tends to soften the blow of Bayes'Rule in cases where the base rates arevery far from symmetry. Also, ifTest 2 is not assumed to be infalliblebut only highly effective, say 90%accurate both ways, results startlooking unimpressive again. Our netfalse positive rate rises from zero to22 cases miscalled "tumor," and weoperate 67 of the actual tumors in-stead of 75. The total hit frequencydrops to 94.5%, only 4|% abovethat yielded by a blind guessing ofthe modal class.

THE SELECTION RATIOStraightforward application of the

preceding principles presupposes thatthe clinical decision maker is free toadopt a policy solely on the basis ofmaximizing hit frequency. Some-

Page 16: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

ANTECEDENT PROBABILITY AND PSYCHOMETRIC SIGNS 209

times there are external constraintssuch as staff time, administrativepolicy, or social obligation whichfurther complicate matters. It maythen be impossible to make all de-cisions in accordance with the baserates, and the task given to the testis that of selecting a subset of caseswhich are decided in the directionopposite to the base rates but willstill contain fewer erroneous decisionsthan would ever be yielded by oppos-ing the base rates without the test.If 80% of patients referred to aMental Hygiene Clinic are recover-able with intensive psychotherapy,we would do better to treat every-body than to utilize a test yielding75% correct predictions. But sup-pose that available staff time islimited so that we can treat only halfthe referrals. The Bayes-type injunc-tion to "follow the base rates whenthey are better than the test" be-comes pragmatically meaningless, forit directs us to make decisions whichwe cannot implement. The imposi-tion of an externally imposed selec-tion ratio, not determined on thebasis of any maximizing or mini-mizing policy but by nonstatisticalconsiderations, renders the testworth while.

Prior to imposition of any arbi-trary selection ratio, the fourfoldtable for 100 referrals might be asshown in Table 8. If the aim weresimply to minimize total errors, we

TABLE 8

ACTUAL AND TEST-PREDICTEDTHERAPEUTIC OUTCOME

Test Therapeutic Outcome

diction Good Poor Total

GoodPoorTotal

602080

5IS20

6535100

would predict "good" for each caseand be right 80 times in 100. Usingthe test, we would be right only 75times in 100. But suppose a selec-tion ratio of .5 is externally imposed.We are then forced to predict "poor"for half the cases, even though this"prediction" is, in any given case,likely to be wrong. (More precisely,we handle this subset as if we pre-dicted "poor," by refusing to treat.)So we now select our 50 to-be-treatedcases from among those 65 who fallin the "test-good" array, having afrequency of 60/65 = 92.3% hitsamong those selected. This is betterthan the 80% we could expect(among those selected) by choosinghalf the total referrals at random.Of course we pay for this, by makingmany "false negative" decisions;but these are necessitated, whether weuse the test or not, by the fact thatthe selection ratio was determinedwithout regard for hit maximizationbut by external considerations. With-out the test, our false negative rateffl is 50% (i.e., 40 of the 80 "good"cases will be called "poor"); the testreduces the false negative rate to42.5% ( = 34/80), since 15 cases fromabove the cutting line must beselected at random for inclusion inthe not-to-be-treated group belowthe cutting line [i.e., 20 + (60/65)15= 34]. Stated in terms of correctdecisions, without the test 40 outof 50 selected for therapy will have agood therapeutic outcome; with thetest, 46 in 50 will be successes.

Reports of studies in which for-mulas are developed from psycho-metrics for the prediction of patients'continuance in psychotherapy haveneglected to consider the relationshipof the selection ratio to the specificpopulation to which the predictionformula is to be applied. In eachstudy the population has consistedof individuals who were accepted for

Page 17: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

210 PAUL E. MEEHL AND ALBERT ROSEN

therapy by the usual methods em-ployed at an outpatient clinic, andthe prediction formula has beenevaluated only for such patients. Itis implied by these studies that theformula would have the same effi-ciency if it were used for the selectionof "continuers" from all those ap-plying for therapy. Unless the for-mula is tested on a random sampleof applicants who are allowed to entertherapy without regard to their testscores, its efficiency for selection pur-poses is unknown. The reportedefficiency of the prediction formulain the above studies pertains only toits use in a population of patientswho have already been selected fortherapy. There is little likelihoodthat the formula can be used in anypractical way for further selection ofpatients unless the clinic's therapistsare carrying a far greater load thanthey plan to carry in the future.

The use of the term "selection"(as contrasted with "prediction" or"placement") ought not to blind usto the important differences betweenindustrial selection and its clinicalanalogue. The incidence of falsenegatives—of potential employeesscreened out by the test who wouldactually have made good on the jobif hired—is of little concern to man-agement except as it costs money togive tests. Hence the industrialpsychologist may choose to expresshis aim in terms of minimizing thefalse positives, i.e., of seeing to itthat the job success among those hiredis as large a rate as possible. Whenwe make a clinical decision to treator not to treat, we are withholdingsomething from people who have aclaim upon us in a sense that is muchstronger than the "right to work"gives a job applicant any claim upona particular company. So, eventhough we speak of a "selection ratio"in clinical work, it must be remem-

bered that those cases not selected arepatients about whom a certain kindof important negative decision isbeing made.

For any given selection ratio, maxi-mizing total hits is always equivalentto maximizing the hit rate for eithertype of decision (or minimizing theerrors of either, or both, kinds), sincecases shifted from one cell of the tablehave to be exactly compensated for.If m "good" cases that were cor-rectly classified by one decisionmethod are incorrectly classified byanother, maintenance of the selectionratio entails that m cases correctlycalled "poor" are also miscalled"good" by the new method. Hencean externally imposed selection ra-tio eliminates the often troublesomevalue questions about the relativeseriousness of the two kinds of errors,since they are unavoidably increasedor decreased at exactly the same rate.

If the test yields a score or a con-tinuously varying index of somekind, the values of pi and pi are notfixed, as they may be with "patterns"or "signs." Changes in the selectionratio, R, will then suggest shiftingthe cutting scores or regions on thebasis of the relations obtaining amongR, P, and the p\, pi combinationsyielded by various cuts. It is worthspecial comment that, in the case ofcontinuous distributions, the opti-mum procedure is not always tomove the cut until the total areatruncated =NR, selecting all abovethat cut and rejecting all those below.Whether this "obvious" rule is wiseor not depends upon the distributioncharacteristics. We have found iteasy to construct pairs of distribu-tions such that the test is "discrimi-nating" throughout, in the sense thatthe associated cumulative frequencies31 and q-t maintain the same directionof their inequality everywhere in therange

Page 18: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

ANTECEDENT PROBABILITY AND PSYCHOMETRIC SIGNS 211

/ i /••<I i.e., I fa(x)dx\ N* J-«,

>~^Jf,(x}dX for all

yet in which the hit frequency givenby a single cut at R is inferior to thatgiven by first selecting with a cutwhich yields NC<NR, and thenpicking up the remaining (NR — Ne)cases at random below the cut. Othermore complex situations may arisein which different types of decisionsshould be made in different regions,actually reversing the policy as wemove along the test continuum. Suchnumerical examples as we have con-structed utilize continuous, unimodaldistributions, and involve differencesin variability, skewness, and kurtosisnot greater than those which arisefairly often in clinical practice. Ofcourse the utilization of any verycomplicated pattern of regions re-quires more stable distribution fre-quencies than are obtainable from thesample sizes ordinarily available toclinicians.

It is instructive to contemplatesome of the moral and administra-tive issues involved in the practicalapplication of the preceding ideas. Itis our impression that a good deal ofclinical research is of the "So —what?" variety, not because of de-fects in experimental design such asinadequate cross validation but be-cause it is hard to see just what arethe useful changes in decision makingwhich could reasonably be expectedto follow. Suppose, for example, it isshown that "duration of psycho-therapy" is 70% predictable from acertain test. Are we prepared to pro-pose that those patients whose testscores fall in a certain range shouldnot receive treatment? If not, thenis it of any real advantage therapeuti-cally to "keep in mind" that the pa-

tient has 7 out of 10 chances of stay-ing longer than IS hours, and 3 outof 10 chances of staying less thanthat? We are not trying to pokefun at research, since presumablyalmost any lawful relationship standsa chance of being valuable to ourtotal scientific comprehension someday. But many clinical papers areostensibly inspired by practical aims,and can be given theoretical inter-pretation or fitted into any largerframework only with great difficultyif at all. It seems appropriate tourge that such "practical"-orientedinvestigations should be really prac-tical, enabling us to see how ourclinical decisions could rationally bemodified in the light of the findings.It is doubtful how much of currentwork could be justified in theseterms.

Regardless of whether the testvalidity is capable of improving onthe base rates, there are some pre-diction problems which have practicalimport? only because of limitations inpersonnel. What other justificationis there for the great emphasis inclinical research on "prognosis,""treatability," or "stayability"? Thevery formulation of the predictivetask as "maximizing the number ofhits" already presupposes that weintend not to treat some cases; sinceif we treat all comers, the ascertain-ment of a bad prognosis score hasno practical effect other than todiscourage the therapist (and thushinder therapy?). If intensive psy-chotherapy could be offered to allveterans who are willing to acceptreferral to a VA Mental HygieneClinic, would it be licit to refusethose who had the poorest outlook?Presumably not. It is interesting tocontrast the emphasis on prognosisin clinical psychology with that in,say, cancer surgery, where the treat-ment of choice may still have a very

Page 19: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

212 PAUL E. MEEHL AND ALBERT ROSEN

low probability of "success," but isnevertheless carried out on the basisof that low probability. Nor doesthis attitude seem unreasonable, sinceno patient would refuse the bestavailable treatment on the groundthat even it was only 10% effective.Suppose a therapist, in the courseof earning his living, spends 200hours a year on nonimprovers byfollowing a decision policy that alsoresults in his unexpected success withone 30-year-old "poor bet." If thisclient thereby gains 16X365X40= 233,600 hours averaging 50% lessanxiety during the rest of his naturallife, it was presumably worth theprice.

These considerations suggest that,with the expansion of professionalfacilities in the behavior field, theprediction problem will be less likethat of industrial selection and morelike that of placement. "To treat ornot to treat" or "How treatable"or "How long to treat" would bereplaced by "What kind of treat-ment?" But as soon as the problemis formulated in this way, the externalselection ratio is usually no longerimposed. Only if we are decidingbetween such alternatives as classicalanalysis and, say, 50-hour interpreta-tive therapy would such personnellimitations as can be expected infuture years impose an arbitrary R.But if the decision is between suchalternatives as short-term interpreta-tive therapy, Rogerian therapy,Thome's directive therapy, hypnoticretraining, and the method of tasks(10, 16, 19), we could "follow thebase rates" by treating every pa-tient with the method known tohave the highest success frequencyamong patients "similar" to him.The criteria of similarity (classmembership) will presumably bemultiple, both phenotypic and geno-typic, and will have been chosen be-

cause of their empirically demon-strated prognostic relevance ratherthan by guesswork, as is currentpractice. Such an idealized situationalso presupposes that the selectionand training of psychotherapists willhave become socially realistic so thattherapeutic personnel skilled in thevarious methods will be available insome reasonable proportion to theincidence with which each method isthe treatment of choice.

How close are we to the upperlimit of the predictive validity ofpersonality tests, such as was reachedremarkably early in the developmentof academic aptitude tests? If thenow-familiar f to f proportions ofhits against even-split criterion di-chotomies are already approachingthat upper limit, we may well dis-cover that for many decision prob-lems the search for tests that willsignificantly better the base rates is arather unrewarding enterprise. Whenthe criterion is a more circumscribedtrait or symptom ("depressed," "af-filiative," "sadistic," and the like),the difficulty of improving upon thebase rates is combined with thedoubtfulness about how valuable it isto have such information with 75%confidence anyhow. But this involveslarger issues beyond the scope of thepresent paper.

AVAILABILITY OF INFORMATIONON BASE RATES

The obvious difficulty we face inpractical utilization of the precedingformulas arises from the fact thatactual quantitative knowledge of thebase rates is usually lacking. But thisdifficulty must not lead to a dismissalof our considerations as clinically ir-relevant. In the case of many clini-cal decisions, chiefly those involvingsuch phenotypic criteria as overtsymptoms, formal diagnosis, sub-sequent hospitalization, persistence in

Page 20: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

ANTECEDENT PROBABILITY AND PSYCHOMETRIC SIGNS 213

therapy, vocational or marital ad-justment, and the numerous "sur-face" personality traits which clini-cians try to assess, the chief reasonfor our ignorance of the base rates isnothing more subtle than our failureto compute them. The file data avail-able in most installations having afairly stable source of clientele wouldyield values sufficiently accurate topermit minimum and maximum esti-mates which might be sufficient todecide for or against use of a pro-posed sign. It is our opinion that thisrather mundane taxonomic task is ofmuch greater importance than hasbeen realized, and we hope that thepresent paper will impel workers tomore systematic efforts along theselines.

Even in the case of more subtle,complex, and genotypic inferences,the situation is far from hopeless.Take the case of some such dynamicattribution as "strong latent de-pendency, which will be anxiety-arousing as therapy proceeds." Ifthis is so difficult to discern even dur-ing intensive therapy that a thera-pist's rating on it has too little reli-ability for use as a criterion, it is hardto see just what is the value of guess-ing it from psychometrics. If askilled therapist cannot discriminatethe personality characteristic afterconsiderable contact with the patient,it is at least debatable whether thecharacteristic makes any practicaldifference. On the other hand, if itcan be reliably judged by therapists,the determination of approximatebase rates again involves nothingmore complex than systematic re-cording of these judgments and sub-sequent tabulation. Finally, "clini-cal experience" and "common sense"must be invoked when there is noth-ing better to be had. Surely if theqi/qt ratio for a test sign claimingvalidity for "difficulty in accepting

inner drives" shows from the formulathat the base rate must not exceed.65 to justify use of the sign, we canbe fairly confident in discarding it foruse with any psychiatric population!Such a "backward" use of the for-mula to obtain a maximum usefulvalue of P, in conjunction with themost tolerant common-sense esti-mates of P from daily experience,will often suffice to answer the ques-tion. If one is really in completeignorance of the limits within whichP lies, then obviously no rationaljudgment as to the probable efficiencyof the sign can be made.

ESTIMATION VERSUSSIGNIFICANCE

A further implication of the fore-going thinking is that the exactnessof certain small sample statistics,or the relative freedom of certainnonparametric methods from dis-tribution assumptions, has to bestated with care lest it mislead clini-cians into an unjustified confidence.When an investigator concludes thata sign, item, cutting score, or patternhas "validity" on the basis of smallsample methods, he has rendered acertain very broad null hypothesisunplausible. To decide, however,whether this "validity" warrantsclinicians in using the test is (as everystatistician would insist) a furtherand more complex question. Toanswer this question, we require morethan knowledge that piT^pt. Weneed in addition to know, with re-spect to each decision for which thesign is being proposed, whether theappropriate inequality involving pi,pi, and P is fulfilled. More than this,since we will usually be extrapolatingto a somewhat different clinicalpopulation, we need to know whetheraltered base rates P' and Q' willfalsify these inequalities. To do thisdemands estimates of the test parame-

Page 21: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

214 PAUL E. MEEHL AND ALBERT ROSEN

ters pi and pz, the setting up of con-fidence belts for their differencepi — pz rather than the mere proof oftheir nonidentity. Finally, if the signis a cutting score, we will want toconsider shifting it so as to maintainoptimal hit frequency with new baserates. The effect upon p\ and p% of acontemplated movement of a criticalscore or band requires a knowledgeof distribution form such as only alarge sample can give.

As is true in all practical applica-tions of statistical inference, non-mathematical considerations enterinto the use of the numerical patternsthat exist among P, pit p2, and R.But "pragmatic" judgments initiallyrequire a separation of the severalprobabilities involved, some of whichmay be much more important thanothers in terms of the human valuesassociated with them. In some set-tings, over-all hit rate is all that wecare about. In others, a redistributionof the hits and misses even withoutmuch total improvement may con-cern us. In still others, the propor-tions pi and #2 are of primary interest;and, finally, in some instances theconfrontation of a certain incrementin the absolute frequency (NPpi) ofone group identified will outweighall other considerations.

Lest our conclusions seem undulypessimistic, what constructive sug-gestions can we offer? We havealready mentioned the following: (a)Searching for subpopulations withdifferent base rates; (b) successive-hurdles testing; (c) the fact that evena very small percentage of improve-ment may be worth achieving incertain crucial decisions; (d) the needfor systematic collection of base-rate data so that our several equa-tions can be applied. To these wemay add two further "constructive"comments. First, test research at-tention should be largely concen-

trated upon behaviors having baserates nearer a 50-50 split, since it isfor these that it is easiest to improveon a base-rate decision policy by useof a test having moderate validity.There are, after all, a large numberof clinically important traits whichdo not occur "almost always" or"very rarely." Test research mightbe slanted more toward them; thecurrent popularity of Q-sort ap-proaches should facilitate the growthof such an emphasis, by directingattention to items having a reason-able "spread" in the clinical popula-tion. Exceptions to such a researchpolicy will arise, in those rare do-mains where the pragmatic conse-quences of the alternative decisionsjustify focusing attention almostwholly on maximizing Ppi, withrelative neglect of Qp^. Secondly,we think the injunction "quit wast-ing time on noncontributory psy-chometrics" is really constructive.When the clinical psychologist seesthe near futility of predicting rare ornear-universal events and traits fromtest validities incapable of improvingupon the base rates, his clinical timeis freed for more economically de-fensible activities, such as researchwhich will improve the parameterspi and p$; and for treating patientsrather than uttering low-confidenceprophecies or truisms about them (inthis connection see 12, pp. vii, 7,127-128). It has not been our inten-tion to be dogmatic about "what isworth finding out, how often." Wedo suggest that the clinical use ofpatterns, cutting scores, and signs,or research efforts devoted to thediscovery of such, should always beevaluated in the light of the simplealgebraic fact discovered in 1763 byMr. Bayes.

SUMMARY

1. The practical value of a psy-

Page 22: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

ANTECEDENT PROBABILITY AND PSYCHOMETRIC SIGNS 215

chometric sign, pattern, or cuttingscore depends jointly upon its in-trinsic validity (in the usual sense ofits discriminating power) and thedistribution of the criterion variable(base rates) in the clinical population.Almost all contemporary researchreporting neglects the base-rate fac-tor and hence makes evaluation oftest usefulness difficult or impossible.

2. In some circumstances, notablywhen the base rates of the criterionclassification deviate greatly from a50 per cent split, use of a test signhaving slight or moderate validitywill result in an increase of erroneousclinical decisions.

3. Even if the test's parameters areprecisely known, so that ordinarycross-validation shrinkage is not aproblem, application of a sign withina population having these same testparameters but a different base ratemay result in a marked change inthe proportion of correct decisions.For this reason validation studiesshould present trustworthy informa-tion respecting the criterion distribu-tion in addition to such test param-eters as false positive and false nega-tive rates.

4. Establishment of "validity" byexact small sample statistics, sinceit does not yield accurate information

about the test parameters (a problemof estimation rather than signifi-cance), does not permit trustworthyjudgments as to test usefulness in anew population with different orunknown base rates.

5. Formulas are presented for de-termining limits upon relationsamong (a) the base rates, (&) falsenegative rate, and (c) false positiverate which must obtain if use of thetest sign is to improve clinical de-cision making.

6. If, however, external constraints(e.g., available staff time) render itadministratively unfeasible to decideall cases in accordance with the baserates, a test sign may be worth ap-plying even if following the baserates would maximize the total cor-rect decisions, were such a policypossible.

7. Trustworthy information as tothe base rates of various patient char-acteristics can readily be obtained byfile research, and test developmentshould (other things being equal) beconcentrated on those characteristicshaving base rates nearer .50 ratherthan close to .00 or 1.00.

8. The basic rationale is that ofBayes' Theorem concerning the cal-culation of so-called "inverse prob-ability."

REFERENCES1. AMERICAN PSYCHOLOGICAL ASSOCIATION,

AMERICAN EDUCATIONAL RESEARCHASSOCIATION, AND NATIONAL COUNCILON MEASUREMENTS USED IN EDUCA-TION, JOINT COMMITTEE. Technicalrecommendations for psychological testsand diagnostic techniques. Psychol.Bull., 19S4, SI, 201-238.

2. ANASTASI, ANNE, & FOLEY, J. P. Differ-ential psychology. (Rev. Ed.) NewYork: Macmillan, 1949.

3. BROSS, I. D. J. Design for decision. NewYork: Macmillan, 1953.

4. DANIELSON, J. R., & CLARK, J. H. Apersonality inventory for inductionscreening. /. din. Psychol., 1954, 10,137-143.

5. DORKEN, H., & Krai, A. The psychologi-cal differentiation of organic brainlesions and their localization by meansof the Rorschach test. Amer. J.Psychiat., 1952, 108, 764-770.

6. DUNCAN, 0. D., OHLIN, L. E., REISS, A.J., & STANTON, H. R. Formal devicesfor making selection decisions. Amer.J. Social, 1953, 58, 573-584.

7. GLUECK, S., & GLUECK, ELEANOR.Unraveling juvenile delinquency. Cam-bridge, Mass.: Harvard Univer. Press,1950.

8. GOODMAN, L. A. The use and validity ofa prediction instrument. I. A reformu-lation of the use of a prediction instru-

Page 23: ANTECEDENT PROBABILITY AND THE EFFICIENCY OF …apsychoserver.psych.arizona.edu/JJBAReprints/PSYC... · ANTECEDENT PROBABILITY AND THE EFFICIENCY OF PSYCHOMETRIC SIGNS, PATTERNS,

216 PAUL E. MEEHL AND ALBERT ROSEN

ment. Amer. J. Social., 1953, 58,S03-S09.

9. HANVIK, L. J. Some psychological dimen-sions of low back pain. Unpublisheddoctor's thesis, Univer. of Minnesota,1949.

10. HERZBERG, A. Active psychotherapy.New York: Grune & Stratton, 1945.

11. HORST, P. (Ed.) The prediction of per-sonal adjustment. Soc. Sci. Res. Coun.Bull, 1941, No. 48, 1-156.

12. MEEHL, P. E. Clinical versus statisticalprediction. Minneapolis: Univer. ofMinnesota Press, 1954.

13. REISS, A. J. Unraveling juvenile de-linquency. II. An appraisal of the re-search methods. Amer. J. Social., 1951,57, 115-120.

14. ROSEN, A. Detection of suicidal patients:an example of some limitations in theprediction of infrequent events. /.consult. Psychol., 1954, 18, 397-403.

15. ROTTER, J. B., RAFFERTY, J. E., & LOT-SOF, A. B. The validity of the RotterIncomplete Sentences Blank: highschool form. /. consult. Psychol., 1954,18, 105-111.

16. SALTER, A. Conditioned reflex therapy.New York: Creative Age Press, 1950.

17. TAULBEE, E. S., & SISSON, B. D. Ror-schach pattern analysis in schizo-phrenia: a cross-validation study. /.din. Psychol., 1954, 10, 80-82.

18. THIESEN, J. W. A pattern analysis ofstructural characteristics of the Ror-schach test in schizophrenia. /. consult.Psychol, 1952, 16, 365-370.

19. WOLFE, J. Objective psychotherapy ofthe neuroses. 5. African Med. J.,1952, 26, 825-829.

Received for early publication December 24,1954.