9
Impact of Data Presentation on Physician Performance Utilizing Artificial Intelligence-Based Computer-Aided Diagnosis and Decision Support Systems L. Barinov 1,2,3 & A. Jairaj 1 & M. Becker 3,4 & S Seymour 1 & E. Lee 3,4 & A. Schram 3,4 & E. Lane 4 & A. Goldszal 3,4 & D. Quigley 4 & L. Paster 3,4 Published online: 15 October 2018 # The Author(s) 2018 Abstract Ultrasound (US) is a valuable imaging modality used to detect primary breast malignancy. However, radiologists have a limited ability to distinguish between benign and malignant lesions on US, leading to false-positive and false-negative results, which limit the positive predictive value of lesions sent for biopsy (PPV3) and specificity. A recent study demonstrated that incorpo- rating an AI-based decision support (DS) system into US image analysis could help improve US diagnostic performance. While the DS system is promising, its efficacy in terms of its impact also needs to be measured when integrated into existing clinical workflows. The current study evaluates workflow schemas for DS integration and its impact on diagnostic accuracy. The impact on two different reading methodologies, sequential and independent, was assessed. This study demonstrates significant accuracy differences between the two workflow schemas as measured by area under the receiver operating curve (AUC), as well as inter- operator variability differences as measured by Kendalls tau-b. This evaluation has practical implications on the utilization of such technologies in diagnostic environments as compared to previous studies. Keywords Breast cancer . Machine learning . Artificial intelligence . Clinical workflow . Computer-aided diagnosis (CAD) . Decision support Introduction Excluding skin cancer, breast cancer has the highest incidence rate and the second highest mortality rate in women [ 1]. Early and accurate diagnosis is a cornerstone strategy used to minimize breast malignancy, morbidity, and mortality. Imaging plays a central role in diagnosis; specifically, digital mammography/ tomosynthesis and ultrasound are the most frequently used screening and diagnostic modalities. In current imaging proto- cols, ultrasound (US) is a valuable tool for evaluating breast tissue, achieving sensitivity comparable to digital mammography (DM) and improved detection of invasive and node-negative breast cancers [2]. This improvement, however, comes at the cost of lower PPV3 and specificity [ 3]. In practice, the increased false- positive rate manifests as an increase in benign biopsies. One avenue being explored to address these concerns is the introduction of machine learning-based artificial intelligence (AI) systems. While these systems have been utilized in the past for mammography, their benefits have been recently called into question [4]. Additionally, rather than aiding in diagnosis, these systems have traditionally been used as an aid for the detection of suspicious areas [57]. This approach has been replicated in automated whole breast ultrasound (ABUS) but is only cleared to target areas not known to have suspicious findings [8]. More recently, tools have been developed to aid the diagnostic performance of radiologists, offering automated assessments of lesion characteristics and risk. Initial iterations have demonstrated a meaningful increase in sensitivity but a large decrease in spec- ificity [4]. As machine learning techniques have progressed over the last 6 years, however, advances in performance within the diagnostic ultrasound space have followed suit. A recent study demonstrated that integration of US with a new AI-based decision support (DS) system offers substantial improvement in both sensitivity and specificity. When tested * L. Barinov [email protected] 1 Koios Medical, New York, NY, USA 2 Princeton University, Princeton, NJ, USA 3 Rutgers University Robert Wood Johnson Medical School, New Brunswick, NJ, USA 4 University Radiology Group, East Brunswick, NJ, USA Journal of Digital Imaging (2019) 32:408416 https://doi.org/10.1007/s10278-018-0132-5

ImpactofDataPresentationonPhysicianPerformanceUtilizingArti ......Keywords Breast cancer . Machine learning . Artificial intelligence . Clinical workflow . Computer-aided diagnosis

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Impact of Data Presentation on Physician Performance Utilizing ArtificialIntelligence-Based Computer-Aided Diagnosis and DecisionSupport Systems

    L. Barinov1,2,3 & A. Jairaj1 & M. Becker3,4 & S Seymour1 & E. Lee3,4 & A. Schram3,4 & E. Lane4 & A. Goldszal3,4 & D. Quigley4 &L. Paster3,4

    Published online: 15 October 2018# The Author(s) 2018

    AbstractUltrasound (US) is a valuable imaging modality used to detect primary breast malignancy. However, radiologists have a limitedability to distinguish between benign and malignant lesions on US, leading to false-positive and false-negative results, whichlimit the positive predictive value of lesions sent for biopsy (PPV3) and specificity. A recent study demonstrated that incorpo-rating an AI-based decision support (DS) system into US image analysis could help improve US diagnostic performance. Whilethe DS system is promising, its efficacy in terms of its impact also needs to be measured when integrated into existing clinicalworkflows. The current study evaluates workflow schemas for DS integration and its impact on diagnostic accuracy. The impacton two different reading methodologies, sequential and independent, was assessed. This study demonstrates significant accuracydifferences between the two workflow schemas as measured by area under the receiver operating curve (AUC), as well as inter-operator variability differences as measured by Kendall’s tau-b. This evaluation has practical implications on the utilization ofsuch technologies in diagnostic environments as compared to previous studies.

    Keywords Breast cancer . Machine learning . Artificial intelligence . Clinical workflow . Computer-aided diagnosis (CAD) .

    Decision support

    Introduction

    Excluding skin cancer, breast cancer has the highest incidencerate and the second highest mortality rate in women [1]. Earlyand accurate diagnosis is a cornerstone strategy used tominimizebreast malignancy, morbidity, and mortality. Imaging plays acentral role in diagnosis; specifically, digital mammography/tomosynthesis and ultrasound are the most frequently usedscreening and diagnostic modalities. In current imaging proto-cols, ultrasound (US) is a valuable tool for evaluating breasttissue, achieving sensitivity comparable to digital mammography(DM) and improved detection of invasive and node-negative

    breast cancers [2]. This improvement, however, comes at the costof lower PPV3 and specificity [3]. In practice, the increased false-positive rate manifests as an increase in benign biopsies.

    One avenue being explored to address these concerns is theintroduction ofmachine learning-based artificial intelligence (AI)systems. While these systems have been utilized in the past formammography, their benefits have been recently called intoquestion [4]. Additionally, rather than aiding in diagnosis, thesesystems have traditionally been used as an aid for the detection ofsuspicious areas [5–7]. This approach has been replicated inautomated whole breast ultrasound (ABUS) but is only clearedto target areas not known to have suspicious findings [8].

    More recently, tools have been developed to aid the diagnosticperformance of radiologists, offering automated assessments oflesion characteristics and risk. Initial iterations have demonstrateda meaningful increase in sensitivity but a large decrease in spec-ificity [4]. As machine learning techniques have progressed overthe last 6 years, however, advances in performance within thediagnostic ultrasound space have followed suit.

    A recent study demonstrated that integration of US with anew AI-based decision support (DS) system offers substantialimprovement in both sensitivity and specificity. When tested

    * L. [email protected]

    1 Koios Medical, New York, NY, USA2 Princeton University, Princeton, NJ, USA3 Rutgers University Robert Wood Johnson Medical School, New

    Brunswick, NJ, USA4 University Radiology Group, East Brunswick, NJ, USA

    Journal of Digital Imaging (2019) 32:408–416https://doi.org/10.1007/s10278-018-0132-5

    http://crossmark.crossref.org/dialog/?doi=10.1007/s10278-018-0132-5&domain=pdfmailto:[email protected]

  • alone, the DS platform was shown to exceed radiologist perfor-mance in US data analysis, showing a 34–55% potential reduc-tion in benign biopsies and an increase in the positive predictivevalue of lesions sent for biopsy (PPV3) of 7–20% [9].

    While the system’s raw performance is promising, DS’spractical efficacy and impact also need to be assessed whenintegrated into existing real-world clinical workflows. Thisstudy investigates the clinical impact of two different diagnos-tic workflows. Clinical impact is evaluated as a function ofhow diagnostic support is presented. Presentation can eitherbe sequential, where the clinician has an initial opportunity toevaluate the case unaided before receiving DS, or indepen-dent, where the case and decision support are presented to-gether. Stand-alone clinician accuracy is compared to that of aclinician utilizing DS, and the system’s impact on intra-operator and inter-operator variability is evaluated. The goalof this study is to evaluate workflow schemas for DS integra-tion and their effects on diagnostic accuracy.

    Methods

    Data Collection

    Using data acquired from the ACRIN 6666 trial [10], 500cases were identified for inclusion. Lesion population statis-tics can be seen in Fig. 1. The dataset was enriched for

    malignancy, while all other statistics were chosen to approxi-mate current population numbers per the Breast CancerResearch Foundation (BCRF) [10]. All pathological groundtruth for malignant lesions came from biopsy-proven patho-logical follow-up, while for benign lesions, ground truth wasestablished via biopsy or 1 year follow-up if the lesions wereBI-RADS 4 and above or BI-RADS 3 and below, respectively.This dataset includes images obtained using a diverse set ofUS equipment and a range of high-frequency breast transduc-ers (Fig. 2). Overall, this equipment and the lesion evaluatedin the dataset accurately represent current clinical practice in-cluding inclusion of cases from both academic and non-academic sites as well as dedicated breast and non-dedicatedimaging centers [10].

    Machine Learning

    Utilizing original radiologist-designated regions of interest(ROI), two orthogonal views of each lesion were used to gen-erate a machine learning system-based score [9]. These ROIswere selected by the radiologist who had originally read thecase clinically. The score ranges from 0.0 to 1.0 and issubdivided into four categories, each representing a clinicalcourse of action (Table 1).

    These scores were presented to all study readers in agraphical form in the electronic Case Report Form (eCRF)(Fig. 3).

    a b c

    d e f

    Fig. 1 Lesion population statistics. a Tumor size. b Tumor grade. c Benign or malignant. dDCIS (non-invasive) vs invasive. e Lymph node status. f BI-RADS designation for the three radiologists tested

    J Digit Imaging (2019) 32:408–416 409

  • Reader Characteristics and Training

    ThreeAmericanBoard ofRadiology (ABR) certified radiologists,training and experience summarized in Table 2, evaluated the 500case dataset. Each radiologist was initially given a 30-min trainingsession to understand the machine learning output, stand-aloneperformance of the system, and the eCRF form. After training,each radiologist demonstrated proper utilization of the study plat-form by assessing 10 test cases proctored by the training staff.

    Each reader specializes in breast imaging and performs amix of screening and diagnostic breast imaging across multi-ple modalities.

    Reader Workflow

    Cases were presented to and scored by radiologists in a singlesoftware environment that showed orthogonal images, ROIs,and theDS output. The study platform then queried the radiologistto input a Breast Imaging Reporting andData System (BI-RADS)score (the current clinical systemused to evaluate lesion suspicion)and likelihood of malignancy (LoM) as a percentage (Fig. 3).

    Using this software, the readers reviewed ultrasound imagesusing two separate workflows which are summarized in Fig. 4:

    Sequential workflow

    1. Readers reviewed the case without the DS output (con-trol read)

    Fig. 2 Ultrasound equipmentcharacteristics. a Manufacturer. bTransducer frequency. Bn/a^refers to cases in which UStransducer frequency was notrecorded

    Table 1 Score ranges and their corresponding categorical outputs.These ranges and categories are inherent to the system and were notdesigned or altered for this study

    Categorical output Score range

    Benign [0, 0.25)Probably benign [0.25, 0.5)Suspicious [0.5, 0.75)Malignant [0.75 1.0]

    410 J Digit Imaging (2019) 32:408–416

  • 2. Readers scored the case via BI-RADS and LoM3. DS output was presented to the readers4. Readers re-scored the case via BI-RADS and LoM

    Independent Workflow

    This workflow was employed after a 4-week Bwashout^ peri-od that followed the sequential workflow. After this washoutperiod, readers where shown the images and the DS output,and they scored the case. In the independent workflow, readerswere presented with 450 cases in randomized order. Fiftycases were withheld and reserved as control cases in order tomeasure intra-reader variability after the 4-week washout pe-riod. This workflow is summarized in the following sequence:

    Readers were presented with one of the following:

    1. Control case read with no DS output (50 Total)

    (a) Readers scored the case via BI-RADS and LoMOR

    2. A case containing DS output (450 Total)

    (b) Readers scored the case via BI-RADS and LoM

    System Evaluation

    Radiologists were presented with DS output that used theoriginal ROIs selected during clinical evaluation of the cases.This choice was made to enhance reader efficiency so that ahigh volume of cases could be evaluated in a practical timeframe. However, while it is felt that the ROI a radiologistchooses manually would be very similar to the ROI used inthis study, it is possible that results would be impacted thevariation caused by manually demarcated ROIs. To evaluatethe system’s robustness to variation in the ROI boundaries,two assessments were performed.

    First, DS output’s robustness to ROI boundary variationwas assessed by evaluating the 500 cases 20 times, randomlyvarying the ROI boundary each time. Specifically, each cornerof the region was shifted at random by up to 20% in eachdimension from the predetermined optimal cropping. ROCcurves and the AUC distributions were calculated.

    Second, the boundaries between BI-RADS 3 ProbablyBenign (P) and BI-RADS 4a Suspicious (S) represents themost clinically impactful diagnostic decision point. It is

    Fig. 3 Screen capture of the study platform. The left side shows two orthogonal views with ROIs. On the right side is the DS output and the radiologistcase assessment input (BI-RADS assessment and likelihood of malignancy percentage)

    Table 2 This table provides a summary of the three readers involved inthis study

    Radiologist ID Post-educational trainingexperience (years)

    ABRcertified

    Breast fellowshiptraining

    1 20+ x x

    2 10+ x x

    3 5+ x

    J Digit Imaging (2019) 32:408–416 411

  • critical to understand the effects of ROI variability on classswitching across categories, and specifically the P-S bound-ary. Since the system’s output is categorical, changes acrossthis decision boundary have the potential to change clinicalmanagement. The level of class switching due to ROI sensi-tivity was evaluated by utilizing the 20 ROIs generated in theprevious analysis and counting the number of times categoryswitching is observed compared to the initial radiologist sup-plied cropping.

    Finally, in order to verify the relationship between the sys-tem’s categorical output and BI-RADS, a sensitivity and spec-ificity analysis was conducted. For each categorical group, thesensitivity and specificity values were compared between thesystem and each participating reader’s BI-RADS assessment.

    Reader Evaluation

    The area under the receiver operating curve (AUC) for eachradiologist across each reading paradigm was calculated andcompared to their control reads. The estimate of the change inAUC as well as the 95% confidence intervals was made byusing the Dorfman-Berbaum-Metz method of MRMC analy-sis using the Metz-ROC LABMRMC software [11].

    Intra- and inter-operator variability was assessed viaKendall’s tau-b correlation coefficient [12]. This assessmentwas done in a pairwise fashion across each pair of readersbefore and after being provided with the decision supportoutput and across each reading methodology.

    Results

    System Evaluation: Variability in ROI boundary producesno significant change in either the shape of the ROC curveor the AUC values (Fig. 5a). Similarly, the ROI analysisshows minimal class switching between P-S/S-P categories,2.7 and 3.3%, respectively (Fig. 5b). We further correlatethe results for each of the categorical groups supplied by thesystem to the BI-RADS assessments provided by the radi-ologist readers (Fig. 6).

    Analysis of the operating points of the system appear to beas good, or better, than the radiologists tested in this study(Fig. 6). This suggests that the performance of categoricaloutputs of the system align to and exceed the performanceof the BI-RADS assessments.

    The reader evaluation analysis showed a system only AUC[95% confidence interval (CI)] of 0.8648 [0.8345–0.8893]. Eachradiologist’s stand-alone performance is detailed in Table 3.

    Similarly, the sequential and independent joint perfor-mance is summarized in Table 4 and Fig. 7.

    Intra-reader and inter-reader variability, as measured byKendall’s tau-b, is summarized in Tables 5 and 6, respectively.

    Discussion

    The current study confirms that this DS system performs fa-vorably when compared with a radiologist’s performance,

    Fig. 4 Schematic representation of the a sequential and b independent reading paradigms. A combination approach seen in c was utilized in this study

    412 J Digit Imaging (2019) 32:408–416

  • confirming prior studies [9]. In addition, rather than simplycomparing DS system performance to that of a radiologist, itextends these prior results by assessing workflows that morerealistically approximate clinical practice. Past studies haveshown mixed evidence on the effect of the tested readingmethodologies on overall reader performance, but none haveconducted an investigation within the context of diagnosticdecision support, and their effects remain unknown [13].Our results show a sizeable variation in performance obtaineddepending on which reading methodology was chosen(Table 4). The impact of the reading methodology on studyperformance has practical workflow considerations. These re-sults suggest there may a strong impact of confirmation bias ina sequential study design. This can be clearly seen in Fig. 7bversus Fig. 7c, where the deviation from the control assess-ment is significantly smaller in the sequential read versus in-dependent read. This has practical implications in the utiliza-tion of machine learning decision support in breast ultrasounddiagnostics and likely beyond.

    Furthermore, as would be expected by providing a supple-mental, concurrent read that out-performs the original reader,overall inter-operator variability decreased significantly.Surprisingly, inter-operator variability decreased even beyondthat of the 4-week washout intra-reader variability perTables 5 and 6. Due to the study design, the effects of a ma-chine learning-based concurrent read on intra-operator vari-ability were not evaluated, but with the evidence presentedin this study, it would seem likely that a proportional decreasein this metric could be expected.

    Clinical Applications

    When looking at practical clinical workflow applications, theperformance results and study design have a number of impli-cations on the application of AI software.

    In clinical practice, the typical sequence of radiologyworkflow is:

    AUC

    Coun

    t

    FPR ( 1 – Specificity)

    TPR

    a b

    Fig. 5 Results of the system evaluation. a ROC curves and corresponding AUCS assessing impact of ROI boundary variation. b Assessment of classswitching due to ROI boundary variation

    System

    System

    Fig. 6 Sensitivity and specificityof each reader’s BI-RADSgrading is compared to that of thesystems correspondingcategorical output

    J Digit Imaging (2019) 32:408–416 413

  • 1. Radiologist assesses mammographic images and instructstechnologist to perform a complete mammographicexamination

    2. Radiologist decides if US is needed3. If so, the technologist acquires images and presents to

    radiologist4. Radiologist assesses images and the radiologist may or

    may not confirm the results with radiologist real timescanning

    5. Radiologist formulates diagnosis

    In the sequential workflow, the radiologist would completeBstep 5^ then assess the DS output. In the independentworkflow, the DS output would be presented to the radiologistduring Bstep 4,^ along with the US images (e.g., along withthe other US Bdata^).

    In clinical practice, if the radiologist has confidence in theDS system, the independent workflow seems more likely toimpact clinical management, e.g., the radiologist looks at allthe data (demographic/clinical history, mammographic, US,DS output) and forms a diagnosis (the ultimate goal).

    Comparison to Other Systems

    Prior research has explored the difference between sequentialand independent study designs and their respective effects onperformance [13–16]. These studies have suggested that bothdesigns produce similar performance results within compara-tive analyses. They then conclude that sequential designs arepreferable due to lower time and resource requirements.

    Interestingly, our results seem to suggest a more significantdeviation between these two study designs.

    The difference between our results and the results discussedabove can perhaps be attributed to the following factors. Thefirst consideration which must be made is that the technologybeing tested and the underlying modality are both different.Second, the task being performed by the decision support sys-tem is all-together different from the CAD systems being ex-amined in these studies. Most of the systems under study arefocused on detection, while DS is focused on diagnosis.

    In detection, a sequential read implies that the clinician iden-tifies regions of interest, performs CAD, and then potentiallyexamines additional regions, as suggested by the CAD device.When performing an independent read, that clinician will seeregions of interest as identified by CAD and may then examineadditional regions based upon their own inspection. In both cases,the clinician is combining their detection results with the CAD’s,so it is reasonable that performance is similar between the two.

    In diagnosis, a sequential read implies that the clinicianwill examine a nodule, arrive at a conclusion, and then re-ceive the recommendation of the DS system. The recom-mendation will either confirm their decision, alter their de-cision, or not be considered. In an independent read, therecommendation is presented at the outset, and is consid-ered as additional input data alongside the image as theclinician makes a decision. In the sequential case, sincethe clinician has already formed an opinion, a dissentingrecommendation may meet with intransigence, and windup suffering from confirmation bias.

    Limitations

    While the results suggest a strong performance benefit, thereare several limitations to the study design that must be takeninto consideration. The study only consisted of three (3)readers. Although the readers had varying degrees of experi-ence within the study, the study does not capture the fullbreadth of readers across the broader population of radiolo-gists that read and interpret breast ultrasound images. The

    Table 3 Each reader’s performance was assessed prior to beingpresented the system’s output. The results of their control reads asmeasured via AUC is shown in this table

    Radiologist ID AUC, 95% CI

    1 0.7618 [0.7244–0.7934]

    2 0.7543 [0.7197–0.7887]

    3 0.7325 [0.6897–0.7689]

    Table 4 In order to compare the two reading methodologies, thereaders’ performance was assessed via AUC compared to theircontrol reads summarized in Table 3. None of the readers attained

    statistical significance when utilizing sequential reads, while allreaders were significantly better when utilizing an independentreader strategy

    Radiologist ID Sequential readAUC, 95% CI

    P value CR vs SRtwo-tailed alpha = 0.05

    Independent read AUC, 95% CI

    P value CR vs IRtwo-tailed alpha = 0.05

    1 0.7935 [0.7567–0.8229] 0.235 0.8213[0.7861–0.8516]

    0.0285*

    2 0.7674 [0.7327–0.8001] 0.601 0.8305[0.7982–0.8594]

    0.00155*

    3 0.7859 [0.7527–0.8174] 0.0532 0.7988[0.7632–0.8310]

    0.0160*

    *Significant

    414 J Digit Imaging (2019) 32:408–416

  • number of cases (500) and enrichment within the study mayalso limit its ability to represent the typical distribution ofcases that a reader would expect to see in practice. The choiceof distribution of the cases was an attempt to create a set thatwas broadly representative of the greater populations of le-sions as a whole but could also answer questions in a statisti-cally meaningful way about events that have low populationincidence. This further extends to the retrospective nature ofthe study design. In clinical practice, additional factors impactclinical reasoning that are not fully represented in the study,such as patient history, prior studies, and patient preferencetowards clinical follow up. While a prospective, randomized

    control study would have addressed some of these concerns, itwould come at the cost of study time and complexity.

    This study did not compare intra-operator variability withand without a decision support output as it would have requiredan additional set of reads for each participant. Based on thecurrent results, it would seem likely that the intra-operator var-iability would decrease, but without study modification, theoccurrence and extent of the variability is unknown.

    The first reading session of this studywas not counterbalanced,and all readers initially read sequential first and independent sec-ond. Cases without a corresponding DS output were randomlyintroduced in the independent session to break up the readingschedule and allow for the evaluation of intra-operator variability.The lack of counterbalancing between sequential and independentreads may have introduced slight reader bias when comparing thetwo paradigms.

    Finally, the current study assesses the impact of DS on USinterpretation in isolation, when in fact a complete breast im-aging assessment incorporates demographic data, clinical his-tory, and other imaging modalities such as mammography.New and exciting avenues of inquiry are needed to more fullyevaluate the role and utility of DS needs in this larger context.

    Fig. 7 Comparative assessment of a control, b sequential, and c independent reading workflows. Operating point specific improvement for Independentvs control assessments were additionally measured (D)

    Table 5 To further characterize reader performance, intra-readervariability was measured via Kendall’s tau-b

    Radiologist ID Kendall’s tau-b for intra-readervariability assessment.

    1 0.597

    2 0.595

    3 0.529

    J Digit Imaging (2019) 32:408–416 415

  • Conclusion

    We have been able to demonstrate that reader workflow cansignificantly affect clinical performance when incorporatingAI-based decision support tools. This evaluation has novelpractical implications on the utilization of such technologiesin diagnostic environments as compared to previous studieswhich have concluded an effective equivalence between thesetwo reading paradigms. Independent reads (concurrent reads)have shown dramatic shifts in reader performance and inter-operator variability as compared to either control reads orsequential reads. The evidence provided in this study can beused to impact both study design when demonstrating efficacyof new diagnostic decision support tools, as well as their im-plementation in practical environments.

    Acknowledgements We would like to thank Dr. Bruce McClennan,William Hulbert, and Graham Anderson for all of the discussion andinsight provided to make this work possible.

    Open Access This article is distributed under the terms of the CreativeCommons At t r ibut ion 4 .0 In te rna t ional License (h t tp : / /creativecommons.org/licenses/by/4.0/), which permits unrestricted use,distribution, and reproduction in any medium, provided you give appro-priate credit to the original author(s) and the source, provide a link to theCreative Commons license, and indicate if changes were made.

    References

    1. Siegel RL, Miller KD, Jemal A: Cancer statistics, 2018. CA: ACancer Journal for Clinicians 68:7–30, 2018. https://doi.org/10.3322/caac.21442

    2. Bae MS, HanW, Koo HR, Cho N, Chang JM, Yi A, Park I, Noh D,Choi WS, Moon WK: Characteristics of breast cancers detected byultrasound screening in women with negative mammograms.Cancer Science 102:1862–1867, 2011. https://doi.org/10.1111/j.1349-7006.2011.02034.x

    3. Berg, Wendie A., Bandos, Andriy I., Mendelson, Ellen B., Lehrer,Daniel, Jong, Roberta A., Pisano, Etta D.; Ultrasound as the prima-ry screening test for breast cancer: analysis from ACRIN 6666,JNCI: Journal of the National Cancer Institute, Volume 108, Issue4, 1 April 2016, djv367, https://doi.org/10.1093/jnci/djv367

    4. Chabi M, Borget I, Ardiles R, Aboud G, Boussouar S, Vilar VS,Dromain C, Balleyguier C: Evaluation of the accuracy of a computer-aided diagnosis (CAD) system in breast ultrasound according to theradiologist's experience. Academic radiology 19(3):311–319, 2012

    5. Gur, D., Sumkin, J.H., Rockette, H.E., Ganott, M., Hakim, C.,Hardesty, L., Poller, W.R., Shah, R., Wallace, L.; Changes in breast

    cancer detection and mammography recall rates after the introduc-tion of a computer-aided detection system, JNCI: Journal of theNational Cancer Institute, Volume 96, Issue 3, 4 February 2004,Pages 185–190, https://doi.org/10.1093/jnci/djh067.

    6. Litjens G, Debats O, Barentsz J, Karssemeijer N, Huisman H:Computer-aided detection of prostate cancer in MRI. in IEEETransactions on Medical Imaging 33(5):1083–1092, 2014. https://doi.org/10.1109/TMI.2014.2303821 http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6729091&isnumber=6803900

    7. Van Ginneken B, Ter Haar Romeny BM, Viergever MA:Computer-aided diagnosis in chest radiography: a survey. inIEEE Transactions on Medical Imaging 20(12):1228–1241, 2001.https://doi.org/10.1109/42.974918 http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=974918&isnumber=21028

    8. FDA.gov. QVIEW Medical, Inc. Premarket approval for QVCADsystem. Summary of safety and effectiveness. Approved:November 9, 2016. https://www.accessdata.fda.gov/cdrh_docs/pdf15/P150043B.pdf

    9. Barinov L, Jairaj A, Paster L, Hulbert W, Mammone R, PodilchukC: Decision quality support in diagnostic breast ultrasound throughartificial intelligence. IEEE Signal Processing in Medicine andBiology Symposium (SPMB)., 2016. https://doi.org/10.1109/spmb.2016.7846873

    10. Data were obtained from the NCI-funded Breast CancerSurveillance Consortium (HHSN261201100031C). Downloadedon [2018, February 21] from the Breast Cancer SurveillanceConsortium website, http://breastscreening.cancer.gov/.

    11. Dorfman DD, Berbaum KS, Metz CE: Receiver operating charac-teristic rating analysis: Generalization to the population of readersand patients with the jackknife method. Investigative Radiology.27:723–731, 1992

    12. KendallM:Anewmeasure of rank correlation. Biometrika. 30(1–2):81–89, 1938. https://doi.org/10.1093/biomet/30.1-2.81.JSTOR2332226

    13. Schalekamp S, van Ginneken B, Schaefer-Prokop CM,Karssemeijer N: Influence of study design in receiver operatingcharacteristics studies: Sequential versus independent reading.Journal of Medical Imaging. 1(1):015501, 2014. https://doi.org/10.1117/1.JMI.1.1.015501 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4478981/.

    14. Beiden SV, Wagner RF, Doi K, Nishikawa RM, Freedman MB-L,Shih-Chung X, Xin-Wei: Independent versus sequential reading inROC studies of computer-assist modalities. Academic Radiology.9(9):1036–1043, 2002 https://www.sciencedirect.com/science/article/pii/S1076633203804798

    15. Gallas BD, Chan H-P, D’Orsi CJ, Dodd LE, Giger ML, Gur D,Krupinski EA, Metz CE, Myers KJ, Obuchowski NA, Sahiner B,Toledano AY, Zuley ML: Evaluating imaging and computer-aideddetection and diagnosis devices at the FDA. Academic radiology.19(4):463–477, 2012. https://doi.org/10.1016/j.acra.2011.12.016https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5557046/#R73.

    16. Obuchowski NA,MezianeM, Dachman AH, Lieber ML,MazzonePJ: What’s the control in studies measuring the effect of computer-aided detection (CAD) on observer performance? AcademicRadiology. 17:761–767, 2010 https://www.ncbi.nlm.nih.gov/pubmed/20457419

    Table 6 Pairwise combinationsof variability were measuredutilizing the sequential read (SR)methodology. Interestingly, theindependent read (IR) variabilitywas lower than intra-readervariability

    Radiologist ID (KTBfor reading methods)

    1 (control, SR, IR) 2 (control, SR,IR) 3 (control, SR,IR)

    1 (control, SR,IR) (1, 1, 1) (0.5505, 0.6263*, 0.7231**) (0.4944, 0.6640*, 0.7476**)

    2 (control, SR,IR) – (1, 1, 1) (0.4229, 0.5641*, 0.6231**)

    3 (control, SR,IR) – – (1, 1, 1)

    *Significant with p < .01; **significant difference with p < 1e−8

    416 J Digit Imaging (2019) 32:408–416

    https://doi.org/10.3322/caac.21442https://doi.org/10.3322/caac.21442https://doi.org/10.1111/j.1349-7006.2011.02034.xhttps://doi.org/10.1111/j.1349-7006.2011.02034.xhttps://doi.org/10.1093/jnci/djv367https://doi.org/10.1093/jnci/djh067https://doi.org/10.1109/TMI.2014.2303821https://doi.org/10.1109/TMI.2014.2303821http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6729091&isnumber=6803900http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6729091&isnumber=6803900https://doi.org/10.1109/42.974918http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=974918&isnumber=21028http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=974918&isnumber=21028https://www.accessdata.fda.gov/cdrh_docs/pdf15/P150043B.pdfhttps://www.accessdata.fda.gov/cdrh_docs/pdf15/P150043B.pdfhttps://doi.org/10.1109/spmb.2016.7846873https://doi.org/10.1109/spmb.2016.7846873http://breastscreening.cancer.gov/https://doi.org/10.1093/biomet/30.1-2.81.JSTOR2332226https://doi.org/10.1117/1.JMI.1.1.015501https://doi.org/10.1117/1.JMI.1.1.015501https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4478981/https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4478981/https://www.sciencedirect.com/science/article/pii/S1076633203804798https://www.sciencedirect.com/science/article/pii/S1076633203804798https://doi.org/10.1016/j.acra.2011.12.016https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5557046/#R73https://www.ncbi.nlm.nih.gov/pubmed/20457419https://www.ncbi.nlm.nih.gov/pubmed/20457419

    Impact...AbstractIntroductionMethodsData CollectionMachine LearningReader Characteristics and TrainingReader WorkflowSequential workflowIndependent WorkflowSystem EvaluationReader Evaluation

    ResultsDiscussionClinical ApplicationsComparison to Other SystemsLimitations

    ConclusionReferences