10
Statistical Power in Observer-Performance Studies: Comparison of the Receiver Operating Characteristic and Free-Response Methods in Tasks Involving Localization 1 Dev Chakraborty, PhD Rationale and Objectives. Statistical power, defined as the probability of detecting real differences between imaging mo- dalities, determines the cost in terms of readers and cases of conducting receiver operating characteristic (ROC) studies. Neglect of location information in lesion-detection studies analyzed with the ROC method can compromise power. Use of the alternative free-response ROC (AFROC) method, which considers location information, has been discouraged, because it neglects intraimage correlations. The relative statistical powers of the two methods, however, have not been tested. The purpose of this study was to compare the statistical power of ROC and AFROC methods using simulations. Materials and Methods. A new model including intraimage correlations was developed to describe the decision variable sampling and to simulate data for ROC and AFROC analyses. Five readers and 200 cases (half of which contained one signal) were simulated for each trial. Two hundred trials, equally split between the null hypothesis and alternative hypoth- esis, were run. Ratings were analyzed with the Dorfman-Berbaum-Metz method, and separation of the null hypothesis and alternative hypothesis distributions was calculated. Results. The AFROC method yielded higher power than the ROC method. Separation of the null hypothesis and alterna- tive hypothesis distributions was larger by a factor of 1.6 regardless of the presence or absence of intraimage correlations. The effect of the incorrect localizations during ROC analysis of localization data is believed to be the major reason for the enhanced power of the AFROC method. Conclusion. The AFROC method can yield higher power than the ROC method for studies involving lesion localization. Greater consideration of this methodology is warranted. Key Words Alternative free-response receiver operating characteristic (AFROC); free-response; lesion localization; mo- dality evaluation; observer performance; receiver operating characteristic (ROC); statistical power. © AUR, 2002 Receiver operating characteristic (ROC) methodology (1,2) is widely used to measure the diagnostic perfor- mance of radiologists and to compare different imaging modalities. In a ROC experiment, the reader assigns nu- meric ratings to each image, thereby quantifying the sub- jective degree of certainty (ie, confidence level) that the image is abnormal. The ratings data for normal and ab- normal cases define points on the ROC curve, which is the plot of the true-positive (TP) fraction versus the false- positive (FP) fraction. The area under the ROC curve, denoted by A z , is commonly used as a metric of perfor- mance and reflects both the ability of the reader and the image quality of the modality. Critical choices in the design of a ROC study that com- pares two imaging modalities are the numbers of readers and cases needed. These are determined with a power analysis Acad Radiol 2002; 9:147–156 1 From the Department of Radiology, University of Pennsylvania, 3600 Mar- ket St, Science Center, Suite 370, Room 115, Philadelphia, PA 19104. Re- ceived September 4, 2001; revision requested October 17; revision re- ceived and accepted October 21. Supported by the National Cancer Insti- tute, by National Institutes of Health grant R01-CA75145-03. Address correspondence to the author. © AUR, 2002 147

Statistical Power in Observer-Performance Studies

  • Upload
    dev

  • View
    217

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Statistical Power in Observer-Performance Studies

Statistical Power in Observer-Performance Studies:Comparison of the Receiver Operating Characteristic andFree-Response Methods in Tasks Involving Localization1

Dev Chakraborty, PhD

Rationale and Objectives. Statistical power, defined as the probability of detecting real differences between imaging mo-dalities, determines the cost in terms of readers and cases of conducting receiver operating characteristic (ROC) studies.Neglect of location information in lesion-detection studies analyzed with the ROC method can compromise power. Use ofthe alternative free-response ROC (AFROC) method, which considers location information, has been discouraged, becauseit neglects intraimage correlations. The relative statistical powers of the two methods, however, have not been tested. Thepurpose of this study was to compare the statistical power of ROC and AFROC methods using simulations.

Materials and Methods. A new model including intraimage correlations was developed to describe the decision variablesampling and to simulate data for ROC and AFROC analyses. Five readers and 200 cases (half of which contained onesignal) were simulated for each trial. Two hundred trials, equally split between the null hypothesis and alternative hypoth-esis, were run. Ratings were analyzed with the Dorfman-Berbaum-Metz method, and separation of the null hypothesis andalternative hypothesis distributions was calculated.

Results. The AFROC method yielded higher power than the ROC method. Separation of the null hypothesis and alterna-tive hypothesis distributions was larger by a factor of 1.6 regardless of the presence or absence of intraimage correlations.The effect of the incorrect localizations during ROC analysis of localization data is believed to be the major reason forthe enhanced power of the AFROC method.

Conclusion. The AFROC method can yield higher power than the ROC method for studies involving lesion localization.Greater consideration of this methodology is warranted.

Key Words Alternative free-response receiver operating characteristic (AFROC); free-response; lesion localization; mo-dality evaluation; observer performance; receiver operating characteristic (ROC); statistical power.

© AUR, 2002

Receiver operating characteristic (ROC) methodology(1,2) is widely used to measure the diagnostic perfor-mance of radiologists and to compare different imagingmodalities. In a ROC experiment, the reader assigns nu-

meric ratings to each image, thereby quantifying the sub-jective degree of certainty (ie, confidence level) that theimage is abnormal. The ratings data for normal and ab-normal cases define points on the ROC curve, which isthe plot of the true-positive (TP) fraction versus the false-positive (FP) fraction. The area under the ROC curve,denoted by Az, is commonly used as a metric of perfor-mance and reflects both the ability of the reader and theimage quality of the modality.

Critical choices in the design of a ROC study that com-pares two imaging modalities are the numbers of readers andcases needed. These are determined with a power analysis

Acad Radiol 2002; 9:147–156

1 From the Department of Radiology, University of Pennsylvania, 3600 Mar-ket St, Science Center, Suite 370, Room 115, Philadelphia, PA 19104. Re-ceived September 4, 2001; revision requested October 17; revision re-ceived and accepted October 21. Supported by the National Cancer Insti-tute, by National Institutes of Health grant R01-CA75145-03. Addresscorrespondence to the author.

© AUR, 2002

147

Page 2: Statistical Power in Observer-Performance Studies

by using standard hypothesis-testing procedures. One startsby specifying the null hypothesis that the two modalities areequal in performance and sets a significance level (P value)for the test, such as P � .05. This specifies the probabilityof incorrectly rejecting the null hypothesis (ie, the probabil-ity of a type I error, commonly denoted by �) as 5%. Next,one specifies the minimum difference (�) in the Az valuesthat is clinically significant and, therefore, that should bedetectable with reasonable certainty. Statistical power is theprobability of correctly rejecting the null hypothesis (thecomplement of power is the probability of a type II error,commonly denoted by �). Too little power could lead toincorrect acceptance of the null hypothesis, and the experi-menter could conclude that the two modalities were statisti-cally insignificantly different when, in fact, the experimentaldesign was faulty.

Power is determined by the inherent statistical variabil-ity, the appropriateness of the statistical methodology, andthe amount of available information that can be includedin the analysis. Power can be increased by (a) increasingthe sample size, such as the number of cases and/or read-ers, because this averages out the case-sampling and thereader variabilities; (b) increasing the sophistication of thestatistical method; and (c) incorporating additional infor-mation into the analysis. Much work has been done in thesecond area, starting with the earliest applications, inwhich Az was determined by crude graphic techniquesthat could produce much uncertainty. These techniqueswere eventually replaced by maximum likelihood estima-tion of Az (3) and, subsequently, by a method that takesadvantage of the correlations that exist between ratingswhen the same cases are interpreted by a reader (4). Cur-rently, the Dorfman-Berbaum-Metz (DBM) procedure (5)allows the analysis of a ROC experiment in which agroup of readers interprets the same set of images ob-tained with two or more modalities.

The third factor mentioned earlier is whether additionalinformation can be incorporated into the analysis of theobserver-performance experiment. The ROC method isstrictly applicable to binary decision tasks, as exemplifiedby the task of deciding if the patient’s condition is abnor-mal or normal. If, as is often the case, the clinical taskdoes not fit the binary model, the lack of a more realisticmodel can lead an investigator to neglect some of thedata, thereby compromising the statistical power of theanalysis. An example of a nonbinary clinical task is le-sion detection during chest radiography. The chest radiol-ogist’s report usually includes location information (eg, asuspicious lesion is present in the left apex), in addition

to the binary diagnosis (ie, normal/abnormal). This reportcannot be cast into a binary variable, and in the analysesof ROC studies involving localized lesions, the usual ap-proach has been to ignore the location information. Awell-known scoring ambiguity (6–8) can then result: Thereader misses a true lesion (ie, a false-negative event) andmistakes a noise location on that image as a lesion (ie, anFP event), yet the ROC score credits the reader with a TPevent. Note that whereas the clinical consequences ofthese mistakes can be serious, the ROC method does notpenalize the imaging modality or the reader, who is sus-ceptible to such errors.

The possibility of compromised statistical power hasled some investigators to examine variants of the ROCexperiment in which location information is considered.Two such variants are the free-response ROC (FROC)and the location ROC (LROC) experiments (6,9). InLROC experiments, each image contains no lesion (ie,normal) or one lesion (ie, abnormal). The reader makes abinary decision and indicates the most suspicious locationon each image. In a FROC experiment, the reader ratesall locations on an image perceived as being suspicious.That is, for each image, the reader could yield zero ormore ratings corresponding to these locations, and eachimage could contain zero (ie, normal) or more lesions (ie,abnormal). Localizations within a clinically acceptabledistance from a true lesion are scored as TP events, andall other events are scored as FP events. Chakraborty andWinter (10) described an FROC model and a maximum-likelihood procedure for estimating its parameters. Thisprocedure involved use of a computer program calledFROCFIT (available from the author). In a subsequentstudy (11), a simpler and more justifiable procedureknown as alternative free-response receiver operatingcharacteristic (AFROC) analysis was proposed, whichused existing ROC software (eg, ROCFIT, available fromhttp://www-radiology.uchicago.edu/krl/toppage11.htm) toanalyze FROC data after it had been appropriately re-scored. Both methods described earlier (ie, FROCFIT andAFROC) assumed independence between TP and FPevents on the same image; in addition, FROCFIT analysisoperated under the assumption that the FP counts on animage had a Poisson distribution.

This study focused on the AFROC method of analyz-ing FROC data. Because events on the same image origi-nate in the same patient, they could be correlated. Neglectof correlations in AFROC analyses has been questioned(12), however, which has discouraged its use. To date,most investigators have used ROC methods, even when

CHAKRABORTY Academic Radiology, Vol 9, No 2, February 2002

148

Page 3: Statistical Power in Observer-Performance Studies

their experiments have involved localization. To myknowledge, the possibility that AFROC methodology mayyet yield higher power than ROC methodology in modal-ity-comparison studies, in which differences in perfor-mance as opposed to absolute performances are measured,has not been considered. As an aside, modality-compari-son studies form the vast majority of applications of ROCmethodology. As a plausibility argument, AFROC couldyield more power both because the violation of the as-sumption might cancel out its effect on modality differ-ences and because neglect of the location information inROC analysis could prove to be more deleterious topower than neglect of correlations in AFROC analysis.

Thus, the purpose of this study was to compare thestatistical power of ROC and AFROC methods using sim-ulations. The hypothesis was that, assuming both methodsare applied to the same sets of images and readers, theAFROC method yields more power than the ROCmethod. Because of practical constraints, one must usemathematic simulations to address this hypothesis. Forthese simulations to be meaningful, a sampling model ofthe decision variable (DV) that reflected intraimage corre-lations was needed. To this end, a new model, called ex-tended FROC (XFROC), was developed that incorporatescorrelations. This model is described herein, along withits application to answering the question regarding statisti-cal power of AFROC versus the ROC method.

MATERIALS AND METHODS

Consider a lesion detection task in which each abnor-mal image contains exactly one lesion. The following twosimulation experiments formed the subjects of this work:a ROC experiment in which each observer rated the im-ages on a scale of 0–100, and an FROC experiment con-ducted with the same images and the same readers. In theFROC experiment, the readers marked suspicious loca-tions on each image and assigned a rating of 0–100 toeach mark. The ROC study was analyzed with the DBMmethod by using software called LABMRMC, availableat http://www-radiology.uchicago.edu/krl/toppage11.htm),and details of the procedure are available in the literature(5). The FROC experiment was scored in the AFROCmanner, as described later, and the resulting ROC-likeratings were analyzed by LABMRMC. It should be notedthat just as ROCFIT or LABROC software can be used toanalyze single-reader, single-modality AFROC data, mul-tiple-reader, multiple-case AFROC data can be analyzedby LABMRMC.

AFROC Scoring of Observer RatingBecause of the assumptions regarding independence,

ratings of the same image in AFROC analysis were re-garded as independent. In human-observer AFROC exper-iments, the default zero (ie, 0) rating is reserved for anyunmarked location. Because the simulation model gener-ated ratings for all sites, this step was unnecessary in thepresent study. Rather than model the multiple FPs on animage as is done during FROC analysis, AFROC analysisuses the FP event receiving the highest confidence level,which is termed the highest-noise (HN) event. This moredescriptive term is used instead of the older (10) term FPimage. An image with no explicit FPs (eg, an unmarkedimage) is scored as an HN event at the zero level. Thus,every abnormal image with one lesion will yield two rat-ings: one corresponding to the lesion, and one corre-sponding to the HN event. For example, in AFROC scor-ing, if the lesion was localized at confidence level 90 andno other explicit events occurred, a TP (90) event and anHN (0) event are recorded. If the reader committed anadditional FP event at confidence level 10, then a TP (90)event and an HN (10) event are recorded. If the readermarks another FP event at confidence level 25, then a TP(90) event and an HN (25) event are recorded; because 25is greater than 10, the FP event at rating 10 no longerqualifies as the HN event. During AFROC analysis, thepresence of lesions also is assumed not to affect the HNrating of any given image. Both independence assump-tions can be violated, however, especially when the ratioof lesion size to image size is not small. The model de-scribed next can accommodate such effects.

The XFROC ModelThe DV is defined as the degree of suspicion for ab-

normality, or the confidence level, that is associated witha given decision. To accommodate multiple signals andresponses per image in an FROC experiment, one mustuse a multivalued DV, with different values correspond-ing to different locations on the same image. An image isconsidered to be a collection of mutually exclusive sites.Each site can contain at most one signal (ie, lesion) (inwhich case it is a signal site) or no signal (in which caseit is a noise site) (ie, normal anatomic background). Anoise (ie, signal) event refers to the occurrence of a spe-cific value of the DV at a noise (ie, signal) site.

For the noise regions, it is useful to think of multival-ued DVs as being the sum of two terms. The first term,or the case sample, is simply the average of all the noisevalues occurring at the different locations on the image. It

Academic Radiology, Vol 9, No 2, February 2002 STATISTICAL POWER IN OBSERVER-PERFORMANCE STUDIES

149

Page 4: Statistical Power in Observer-Performance Studies

is location independent, and it depends only on the partic-ular case (ie, patient). The second term describes the loca-tion-to-location variability of the noise DVs on any givenimage, or the location samples. Similarly for the signalregions. For example, a mammogram of a woman with adense, thick breast is inherently more difficult to obtain(eg, higher kVp, more scatter) and to use in establishing adiagnosis than that of a woman with fatty, thinner breasts.This type of interpatient variability reflects the case-sam-pling component. Furthermore, certain regions on a mam-mogram of any given patient are more difficult to depictthan other regions, such as glandular versus fatty regions.This type of intrapatient variability reflects the location-sampling component. During tasks involving localization,noise sites occur in both true-normal and true-abnormalimages; that is, all nonlesion sites can potentially generateFP events on any image. Therefore, an abnormal imagewill yield two case samples: one corresponding to thebackground regions and one corresponding to the signal

regions. A normal image will yield one case sample,which corresponds to the background regions.

The Figure shows the XFROC model for an abnormalimage. Shown at the top and labeled as CASE is a corre-lated, bivariate, Gaussian distribution describing the casesampling. The left distribution describes the patient-to-patient variability in the difficulty level of the location-averaged background or noise DV, and the right distribu-tion describes the corresponding variability of the signalDV. The curved arrow labeled �SN at the top indicates thecorrelation between the two distributions. The correlationparameter �SN has the subscript “SN” to emphasize that itcorresponds to the correlation of the noise and signalsites. The left distribution is centered at zero (by conven-tion), and the right distribution is centered at �. The cor-responding standard deviations are �CN and �CS, respec-tively, with the subscripts denoting case noise (CN) andcase signal (CS), respectively. A particular abnormal caseprovides two DV outcomes, labeled � and �, from this

The XFROC model for an abnormal image. Shown at the top and labeled CASE is acorrelated bivariate distribution describing the case sampling. A particular abnormalcase provides two samples (�, �) from this distribution. Centered on the � and � valuesare two distributions labeled LOCATION, from which the DVs for the different locationson the image are sampled. Background locations are sampled from the left distribution,and lesion locations are sampled from the right distribution.

CHAKRABORTY Academic Radiology, Vol 9, No 2, February 2002

150

Page 5: Statistical Power in Observer-Performance Studies

distribution, which correspond to the average case-depen-dent background and lesion properties, respectively, ofthat image.

Centered on the � and � values in the Figure are twoGaussian distributions, labeled LOCATION, from whichthe DVs for the different locations on that image are sam-pled independently. Background or noise locations aresampled from the left distribution and lesions locationsfrom the right distribution. The corresponding standarddeviations are �LN and �LS, respectively, with the sub-scripts denoting location noise (LN) and location signal(LS), respectively. Two such samples, � and , are shownin the Figure, corresponding to a noise and a signal site,respectively. The net DVs x and y, corresponding to thenoise region and the signal region, respectively, are givenby � � � and by � � , respectively. The total numberof location samples will equal the total number of inde-pendent sites that the image can accommodate, which isdenoted by T. For a normal image, only the two left dis-tributions apply. The measurement scale can be set forthe DV axis by choosing �LN � 1. Therefore, theXFROC model is described by five parameters: �, �CS,�CN, �SN, and �LS. Note that setting �CS � �CN � 0causes the XFROC model to revert to the classic binor-mal ROC model.

Appendix A shows the mathematic details of thismodel. Appendix B shows how the general XFROCmodel can simulate AFROC ratings data. The XFROCmodel describes intraimage correlations, and Appendix Cdescribes how interimage correlations were included inthe analysis. These are usually considered to be correla-tions arising from case and reader matching and are com-monly described by the DBM variance-componentsmodel, which has been adapted to the needs of this study.

Converting AFROC Ratings to ROC RatingsBecause this study compared the ROC and AFROC

methods by using the same set of images, it was neces-sary to convert the two AFROC ratings on an abnormalimage to a single “equivalent” ROC rating. According tothe method described by Swensson (9), the latter was as-sumed to be given by the higher of the two AFROC rat-ings. For example, if the two AFROC ratings were HN(50) and TP (70), then the image received a ROC ratingof 70, which would correspond to a correct localizationevent, because the DV of the signal event determined theoverall DV of the abnormal image. For an image with theratings HN (80) and TP (50), the assigned rating was 80,which would correspond to an incorrect localization

event. During the ROC simulations, the number of timesthat the HN rating exceeded the signal rating was re-corded, yielding the incorrect localization count. Whendivided by the number of abnormal images, this yieldedan incorrect localization fraction for the experiment.

SimulationsSimulated null hypothesis and alternative hypothesis

experiments were conducted in which data correspondingto a group of five readers viewing the same set of 200images (half of which were normal) using two modalitieswere generated according to the XFROC model. For theROC experiments, the data were converted to equivalentROC ratings as described earlier. For the AFROC experi-ments, the data were regarded as arising from 200 normalimages and from 100 abnormal images, with the lattereach containing one lesion. In either case, the final ratingsdata were analyzed by using the LABMRMC program,which yielded the F statistic that, in turn, was used toquantify the significance of the modality effect. The pro-cedure was repeated 100 times for each hypothesis. The Fvalues for null hypothesis and null hypothesis experi-ments were used to calculate a ROC-like area, because aROC-like curve (ie, the power curve) results when 1 �Probability (type II error) is plotted versus Probability(type I error). Analogous to ROC analysis (13), the Wil-coxon statistic was used to estimate the area under thepower curve. This was converted to a d� index by using astandard formula (14). This index measures the normal-ized separation of the null hypothesis and null hypothesisdistributions and is denoted by d�(ROC) or d�(AFROC)for the two respective methods of analysis. The d� indexcan be regarded as the signal-to-noise ratio of the modal-ity comparison experiment. A larger separation implieshigher power in the modality-comparison experiment. Theobject of the simulations was to determine which method,ROC or AFROC, yields the larger d� measure. The simu-lations were restricted to �CS � �CN � �G, where �G de-notes the common (global) standard deviation, and to�LS � �LN � 1. Two intraimage correlation models wereconsidered: one uncorrelated, with �G � 0; and one cor-related, with �G � 5 and �SN � 0.5. The analysis wasrepeated for the two interimage correlation models, three� values (0.75, 1.5, and 2.5), three �� values (0.25, 0.30,and 0.35) quantifying the magnitudes of the modality dif-ferences, and eight variance structures (see Appendix C)describing different levels of reader and intercase correla-tions (15).

Academic Radiology, Vol 9, No 2, February 2002 STATISTICAL POWER IN OBSERVER-PERFORMANCE STUDIES

151

Page 6: Statistical Power in Observer-Performance Studies

Additional simulations were performed to investigatethe power issue further. The standard ROC model (15)was used to generate DVs, and ROC analysis with 100abnormal and 100 normal images was compared to ROCanalysis with 200 normal and 100 abnormal images. Ad-ditionally, ROC analysis was also performed on XFROCmodel simulations with 100 normal and 100 abnormalimages with uncorrelated data sets, but when correct lo-calization was forced (ie, regardless of the HN rating), thesignal region rating was selected to be the ROC rating.

RESULTS

Table 1 shows the d� results for uncorrelated and cor-related cases, averaged over all variance structures, forthe three values of � (0.75, 1.5, and 2.5) and the threevalues of �� (0.25, 0.30, and 0.35) that were studied. Ineither correlation model, for a constant �, both ROC andAFROC d� values increased with ��. This was expected,because power increases with the size of the effect. Notealso the substantially larger d� values with AFROC com-pared to ROC. The ratio of the averaged d� values was1.65 for the uncorrelated cases and 1.59 for the correlatedcases. The incorrect localization events occurring on ab-normal images in ROC analysis were 27.6%, 13.4%, and3.6% of the number of abnormal images for a � of 0.75,1.5, and 2.5, respectively, for uncorrelated cases whenaveraged over all other factors. These results were ap-proximately 1% smaller for the correlated cases. The in-correct localization fractions were also observed to de-crease as �� increased. Another observed trend was that

as � increased, the ratio (AFROC:ROC) of the averagedd� values decreased. If this notion (discussed later) thatthe incorrect localization events are the main reason forthese differences is accepted, then this trend is consistentwith the observation that the incorrect localization eventsdecreased as � increased. In other words, as the task be-comes easier, a noise location is more signal-like than atrue signal on fewer occasions, and the differences be-tween ROC and AFROC analyses tend to decrease.

The observed probability (�) of making type I errorsand the observed power (1 � �) are shown in Table 2 forthe different simulation conditions. Note that for bothuncorrelated and correlated cases, the observed probabili-ties of making type I errors were similar for AFROC andROC analysis, and both were close to the target P valueof 5%. The larger observed power values for AFROCshown in Table 2 are consistent with the correspondinggreater separations of the null hypothesis and null hypoth-esis distributions shown in Table 1.

As noted earlier, additional simulations were conductedto investigate why AFROC analysis gave higher powerthan ROC analysis. Table 3 summarizes results for these

Table 1Results of Simulations for Uncorrelated Cases (�G � 0 and �SN � 0) and for Finite Intraimage Correlations(�G � 5 and �SN � 0.5)

� ��

Uncorrelated Cases Correlated Cases

d�(ROC) d�(AFROC) d�(ROC) d�(AFROC)

0.75 0.25 0.60 1.16 0.61 1.170.75 0.30 0.81 1.44 0.81 1.430.75 0.35 0.98 1.74 1.00 1.741.50 0.25 0.64 1.13 0.68 1.131.50 0.30 0.92 1.37 0.98 1.411.50 0.35 1.01 1.66 1.06 1.642.50 0.25 0.46 0.86 0.50 0.842.50 0.30 0.78 1.09 0.76 1.032.50 0.35 0.97 1.38 0.97 1.35

Average 0.80 1.31 0.82 1.30

Note.—The results have been averaged over all eight variance structures. The last row lists the grand averages.

Table 2Type I Error Probability (�) and Power (1 � �)Observed in Simulations

Uncorrelated Correlated

� 1 � � � 1 � �

ROC 0.047 0.228 0.051 0.236AFROC 0.049 0.374 0.051 0.380

CHAKRABORTY Academic Radiology, Vol 9, No 2, February 2002

152

Page 7: Statistical Power in Observer-Performance Studies

and for the preceding simulations. Listed are the simula-tion numbers, DV sampling model used (XFROC orROC), whether intraimage correlations were included,method of analysis (ROC or AFROC), number of normaland abnormal images in the simulations, whether correctlocalization was forced, and observed d� value averagedover all values of �, values of ��, and variance struc-tures.

Several points emerge from the data presented in Table3. First, the d� values are relatively precise (sampling er-ror, approximately 2%), because data from 14,400 simula-tions (three � values, three �� values, eight variancestructures, 100 null hypothesis trials, and 100 null hypoth-esis trials) have been averaged for each value. Second,intraimage correlations of the magnitude considered inthis study had a negligible effect on d� values, whichwere within the sampling error (compare simulations 1and 2 or simulations 3 and 4). Third, simulation 5(XFROC model, 100 � 100 cases, no correlations, ROCanalysis, correct localization forced) and simulation 6(ROC model, 100 � 100 cases) yielded essentially identi-cal d� values (approximately 1.065). This was expected,because with forced correct localizations and zero correla-tion, ROC analysis of XFROC-generated data is equiva-lent to ROC analysis of data from the ROC model.Fourth, inclusion of the incorrect localizations reducedd�(ROC) from 1.065 (average of simulations 5 and 6) toapproximately 0.81 (simulations 3 and 4). And fifth, sim-ulation 7 (ROC model, 200 � 100 cases) showed an im-provement of only approximately 9% relative to simula-tion 6 (ROC model, 100 � 100 cases), which is much

smaller than, for example, the 60% effect observed be-tween simulations 1 and 3. This suggests that the “nor-mal-case doubling effect” is insufficient to explain theobserved difference between AFROC and ROC analyses.The effect of forcing correct localizations in ROC analy-sis (31%, compare simulations 3 and 5) is larger than thatof doubling the number of normal cases (9%, comparesimulations 6 and 7), which suggests that incorrect local-izations are the major effect.

DISCUSSION

The larger number of noise responses available inAFROC analysis (200 AFROC vs 100 ROC responses forthe case sizes simulated) could constitute a reason whythe AFROC method yielded greater power than the ROCmethod. Another possible reason is the effect of the in-correct localizations in ROC analysis. Every incorrectlocalization event mistakenly assigns a noise-generatedrating to a signal event. This tends to decrease the ob-served effect (��) of a treatment that enhances signalDVs over noise DVs. Additionally, the incorrect localiza-tion events could contribute additional case-samplingnoise. On the basis of the results shown in Table 3, thelarger number of responses is believed to be a relativelyminor effect. The incorrect localizations, which can ac-count for as much as 27% of events on abnormal imagesat � � 0.75, probably represent the major reason for theincreased power of the AFROC method.

This study focused on the effect of changing only the� parameter between the two modalities (the modality

Table 3Summary of All Simulations

SimulationDV

ModelData

CorrelatedMethod ofAnalysis

No. ofImages

CorrectLocalizationForced* d�

1 XFROC No AFROC 100 � 100 No 1.312 XFROC Yes AFROC 100 � 100 No 1.303 XFROC No ROC 100 � 100 No 0.804 XFROC Yes ROC 100 � 100 No 0.825 XFROC No ROC 100 � 100 Yes 1.056 ROC No ROC 100 � 100 NA 1.087 ROC No ROC 100 � 200 NA 1.18

Note.—Simulations 1, 2, 3, and 4 represent the results already reported in Table 1. Simulation 5 represents an analysis in which cor-rect localizations were forced. Simulation 7 represents a ROC model analysis with 200 normal and 100 abnormal images. The effect offorcing correct localizations in ROC analysis (compare simulations 3 and 5) is larger than that of doubling the number of normal cases(compare simulations 6 and 7), suggesting that incorrect localizations are the major effect.*NA � not applicable.

Academic Radiology, Vol 9, No 2, February 2002 STATISTICAL POWER IN OBSERVER-PERFORMANCE STUDIES

153

Page 8: Statistical Power in Observer-Performance Studies

effect is specified completely by ��). It is possible fortwo modalities to differ in other parameters of theXFROC model, however, and more work is needed toensure that the present conclusions remain valid in othersituations. Additional validation is needed when realisticvalues of T are used, because then the HN distributionmay not be Gaussian (T did not enter the present calcula-tions, and the Gaussian assumption is valid when T isinfinite). A minor weakness, in my opinion, is that theXFROC model does not permit arbitrary correlations be-tween events on the same image. For example, the corre-lation between noise events could depend on their spatialseparation, with larger correlations to be expected forclosely spaced events. In the XFROC model, these corre-lations are modeled by a constant value that representsthe average noise–noise correlation and, similarly, by thesignal–noise and signal–signal correlations. The gains inmodeling accuracy with more general correlations mustbe balanced against the increased model complexity andthe correspondingly greater difficulty in solving the in-verse problem. For arbitrary correlations, T samples perimage would require a T-variate distribution with 2T �T(T � 1)/2 parameters, which increases as T 2 for large T.An image set with, on average, 10 possible noise sites perimage, which does not appear to be unrealistic for lesiondetection in chest radiography, would require more than100 parameters; solving this would be a daunting task.

Currently, no reliable method for solving the inverseproblem of estimating model parameters from reader datais available. The model depicted in the Figure is a contin-uous, bivariate mixture distribution model. The expecta-tion-maximization algorithm (16) is widely used by statis-ticians to solve such models, and adapting this algorithmto solve the XFROC estimation problem should be possi-ble. Although the present results have shown that, formodality comparison studies, AFROC analysis is resistantto violations of the independence assumptions, a solutionof the XFROC estimation problem remains highly desir-able and would have several benefits. First, a detectabilitymeasure could be developed that is independent of T andthe number of lesions per image, because these would beincluded in the model. Neither AFROC nor LROC analy-sis have this feature, which makes intercomparison andcombining (meta-analysis) of different AFROC/LROCdata sets difficult. Second, it should provide even morepower than AFROC analysis, because the AFROCmethod is based on assumptions that may be violated andreplaces multiple FP ratings occurring on an image withone HN rating, which does not use all the available infor-

mation. Finally, T can be estimated from the data by re-garding it as an unknown model parameter to be deter-mined by the likelihood maximization. Because a statisti-cal measure of performance is being calculated, it is notnecessary that the individual images have identical T val-ues, and a model that assumes a single T value is ex-pected to be useful. The quantity T is expected to be ob-server dependent, and knowledge regarding its valuecould be important in training radiologists. For example,for a given image set, a radiologist with an unusuallylarge T compared to his or her colleagues must be exam-ining clinically extraneous areas of the image that gener-ate only FP events, and knowledge of this could be usedto improve his or her performance. These potential bene-fits are presently unrealized, but by these speculativecomments it is hoped that other investigators will be stim-ulated to contribute to this field.

Swensson (9) has shown that use of the location re-sponse when fitting conventional ROC curves reduced thestandard error of the conventional ROC curve fit. Al-though use of standard errors to evaluate a new analyticROC method is not uncommon, this was not the case inthe present study. I believe that by comparing standarderrors alone, one cannot determine which method is bet-ter. It is more relevant to study power, as was done here,and to show the effect on the separation of the null hy-pothesis and null hypothesis distributions. As an extremeexample, a “stuck” measuring instrument would alwaysyield the same measurement and, hence, zero standarderror, but it would have zero power.

In conclusion, the principle contribution of this workhas been development of the XFROC model, which al-lows realistic correlations to be incorporated with rela-tively few parameters. This model was then used to com-pare the statistical powers of ROC and AFROC methods.The AFROC methodology was superior in tasks involvingsingle-lesion localization by a factor of approximately 1.6as measured by the net separation of the null hypothesisand alternative hypothesis distributions, regardless ofwhether correlations were included in the model. Allow-ing more than one lesion per image would tend to furtherincrease the advantage of AFROC compared to ROCanalysis. Observer-performance experiments are needed toconfirm the simulations, and it would be interesting toapply the simulation methodology to LROC analysis andto a methodology recently proposed by Obuchowski et al(17), in which similar gains in statistical power are pos-sible.

CHAKRABORTY Academic Radiology, Vol 9, No 2, February 2002

154

Page 9: Statistical Power in Observer-Performance Studies

APPENDIX A

Mathematical Details of the XFROC Model

A random variable x sampled from the distribution Dis denoted by x � D. For an abnormal image i (i � 1,2, . . . , I, where I is the total number of images), the DVsxij and yij corresponding to the jth noise and signal sites,respectively, on the ith image are given by

xi,j � N��i, �LN�yi,j � N��i, �LS�

� (A1)

where

��i, �i� � N2�0, �, �CN, �CS, �SN�. (A2)

Here, N(�, �) denotes the normal distribution with mean� and standard deviation �, and N2(�1, �2, �1, �2, �12)denotes the bivariate normal distribution with parameters�1 and �2, which correspond to the means of the two ran-dom variables, �1 and �2 correspond to their standarddeviations, and �12 is their correlation. The indices (i, j)are the case-sample and location-sample indices, respec-tively, and �i and �i are the noise and signal case sam-ples, respectively, corresponding to the ith image. Be-cause the position of the location distributions depends onthe samples from the case distributions, the model is tech-nically a mixture distribution model (16).

Equations (A1) and (A2) also describe how samplesare generated when multiple noise/signal sites are in-volved. The DVs, corresponding to the multiple noise/signal sites on an image i, are obtained first by samplingthe bivariate distribution. Then, the location distributionscentered on (�i, �i) are sampled to obtain the differentnoise/signal location samples (xi,j, yi,k), where j � 1,2, . . . , T � Si and k � 1, 2, . . . , Si. Here, T is the totalnumber of sites, and Si is the number of signals on theimage.

Intraimage Correlations in the XFROC ModelAlthough the location DVs are sampled independently

from the location distributions, the XFROC model accom-modates correlations between different events on the sameimage. Denoting the correlations between the randomvariables xx, yy, and xy, by �xx, �yy, and �xy, respectively,it can be shown by using moment generating functionsthat

�xx � CN

2

�LN2 � �CN

2

�yy � CS

2

�LS2 � �CS

2

�xy �SN�CN�CS/���CN2 � �LN

2 ��CS2 � �LS

2 ). (A3)

As a subtle point, one might be tempted to considerEquation (A3) as defining the parameters of a bivariatedistribution and use this distribution instead of the two-stage XFROC model shown in the Figure as the samplingmodel. Such a model, however, would not be equivalentto the XFROC model. The sampling implied in calculat-ing the correlations in Equation (A3) is over images. Forexample, �xx is the correlation between DVs at a pair ofnoise locations on an image, but sampled over all images.When the sampling is over pairs of noise locations ondifferent images, the correlation (averaged over all im-ages) would be zero, which is implied by the XFROCmodel but is not by Equation (A3).

APPENDIX B: SIMULATING AFROC DATABY USING THE XFROC MODEL

For AFROC analysis, one does not need to model indi-vidual FP events; rather, one needs to model the HN dis-tribution. This can be done by identifying the lower-leftdistribution in the Figure as the location HN distribution.The net HN DV for an image i is given by

xi � N��i, �LN�. (B1)

Note the absence of the j subscript in this equation, be-cause there is only one HN event per image.

APPENDIX C: INTERIMAGE CORRELATIONSOF THE RATINGS: THE DBM VARIANCECOMPONENT MODEL

Because the same observers read the same images inthe matched-reader, matched-case design, the ratings werecorrelated. To include reasonable values for the correla-tions, the DBM DV sampling model was modified. Fol-lowing the notation used by Roe and Metz (15) and spe-cializing to the case of no reading replications, the DBMmodel is described as

Academic Radiology, Vol 9, No 2, February 2002 STATISTICAL POWER IN OBSERVER-PERFORMANCE STUDIES

155

Page 10: Statistical Power in Observer-Performance Studies

Xijkt �t � �it � Rjt � Ckt � ��R�ijt � ��C�ikt

� �RC�jkt � Eijkt, (C1)

where � refers to the modality; R is the reader; C is thecase; Xijkt is the DV for modality i (i � 0 for modality 1,i � 1 for modality 2), reader j ( j � 0, 1, 2, . . . , numberof readers � 1), case k (k � 0, 1, 2, . . . , number ofcases � 1), and truth state t (t � 0 for normal, t � 1 forabnormal); and �t is the DV for truth state t, which waszero for normal images and � (� � 0.75, 1.5 and 2.5 inthe simulations) for abnormal images. The term �it, whichis the effect of treatment i and truth state t, is often theprimary quantity of interest, because it describes the mo-dality effect. It was set �i0 � �01 � 0 and �11� ��. Theparameter �� specified the difference between the twomodalities and was varied during the simulations (�� �0.25, 0.30, and 0.35). Rjt is the effect of reader j undertruth state t. Ckt is the effect of case k under truth state ton the DV. The term (�R)ijt is the modality–reader inter-action, (�C)ikt is the modality–case interaction, (RC)jkt isthe reader–case interaction, and Eijkt is a random errorterm. The variances of the random variables appearing onthe right-hand side of Equation (C1) are denoted, in or-der, by �R

2 , �C2 , ��R

2 , ��C2 , �RC

2 , and �E2 and are collec-

tively referred to as a variance structure. Values werechosen for these terms as specified by Roe and Metz (15),and the variance structures and � values used in thatwork were used here.

Equation (C1) describes the DV sampling when thereis one sample per image. In the XFROC case, this wastrue only for the normal images. For abnormal images,two samples were necessary per image, corresponding tothe HN and signal events. To accommodate this differ-ence, Equation (C3) was modified as follows: The truthindex t was allowed three values instead of two (as in thework of Roe and Metz). The allowed values were 0 (nor-mal image, HN sample), 1 (abnormal image, HN sample),and 2 (abnormal image, signal sample). The samples cor-responding to t � 0 and either t � 1 or t � 2 were al-ways uncorrelated, because they arose from different im-ages. The samples corresponding to t � 1 and t � 2,however, could be correlated. The correlations could beintrinsic to the cases Ckt (as implicit in the XFROC modeldescription) or introduced by any of the other terms bear-ing k and t subscripts in Equation (C3). For example, theRCkt term can introduce intraimage correlations if RCk1

and RCk2 are correlated. For simplicity, correlations inthis study were only allowed to enter via the Ckt term.

Samples from N(0, �C) were used to generate Ck0 for thenormal images and samples from N2(0, �, �G, �G, �SN) togenerate the case components of Ck1 and Ck2 for the abnor-mal images. To these were added the appropriate locationcomponents, which were sampled from N(0, 1). Note thatfor t � 0, the sum of the case and the location componentscan be regarded as a single Gaussian random variable, sotwo samples were unnecessary to simulate the normal im-ages. The values of the XFROC parameters were adjusted toensure that the net variances matched the target values aspresented by Roe and Metz (15). For example, the value of�G was scaled so that the net variance for the background orsignal samples was � C

2 . This involved scaling �G 3 k � �G,where k��(� C

2 /[(� G2 �1]).

REFERENCES

1. Metz C. ROC methodology in radiologic imaging. Invest Radiol 1986;21:720–733.

2. Metz C. Some practical issues of experimental design and data analy-sis in radiological ROC studies. Invest Radiol 1989; 24:234–245.

3. Dorfman D, Alf E. Maximum-likelihood estimation of parameters of sig-nal-detection theory and determination of confidence intervals: rating-method data. J Math Psychol 1969; 6:487–496.

4. Metz C, Wang PL, Kronman H. A new approach for testing the signifi-cance of differences between ROC curves measured from correlateddata. In: Deconinck F, ed. Information processing in medical imaging.The Hague, the Netherlands: Nijhoff, 1984; 432–445.

5. Dorfman DD, Berbaum KS, Metz CE. ROC characteristic ratinganalysis: generalization to the population of readers and patients withthe jackknife method. Invest Radiol 1992; 27:723–731.

6. Starr S, Metz C, Lusted L, Goodenough D. Visual detection and local-ization of radiographic images. Radiology 1975; 116:533–538.

7. Starr S, Metz C, Lusted L. Comments on generalization of receiveroperating characteristic analysis to detection and localization tasks.Phys Med Biol 1977; 22:376–379.

8. Bunch P, Hamilton J, Sanderson G, Simmons A. A free-response ap-proach to the measurement and characterization of radiographic-ob-server performance. J Appl Photogr Eng 1978; 4:166–171.

9. Swensson RG. Unified measurement of observer performance in de-tecting and localizing target objects on images. Med Phys 1996; 23:1709–1725.

10. Chakraborty D. Maximum likelihood analysis of free-response receiveroperating characteristic (FROC) data. Med Phys 1989; 16:561–568.

11. Chakraborty DP, Winter L. Free-response methodology: alternate anal-ysis and a new observer-performance experiment. Med Phys 1990;174:873–881.

12. Metz CE. Evaluation of digital mammography by ROC analysis. In: DoiK, Giger M, Nishikawa R, Schmidt RA, eds. Digital mammography ‘96.Amsterdam, the Netherlands: Elsevier Science, 1996.

13. Hanley J, McNeil B. The meaning and use of the area under a receiveroperating characteristic (ROC) curve. Radiology 1982; 143:29–36.

14. Burgess A. Comparison of receiver operating characteristic and forcedchoice observer performance measurement methods. Med Phys 1995;22:643–655.

15. Roe C, Metz CE. Dorfman-Berbaum-Metz method for statistical analy-sis of multireader, multimodality receiver operating characteristic data:validation with computer simulation. Acad Radiol 1997; 4:298–303.

16. McLachlan GJ, Krishnan T. The EM algorithm and extensions. NewYork, NY: Wiley, 1997.

17. Obuchowski NA, Lieber ML, Powell KA. Data analysis for detectionand localization of multiple abnormalities with application to mammog-raphy. Acad Radiol 2000; 7:516–525.

CHAKRABORTY Academic Radiology, Vol 9, No 2, February 2002

156