Mammography Assessment using Multi-Scale Deep Classifiers · 2019. 10. 8. · Mammography Assessment using Multi-Scale Deep Classifiers Ulzee An∗ New York University 251 Mercer

Mammography Assessment using Multi-Scale Deep Classifiers

Ulzee An∗

New York University251 Mercer St

New York, NY [email protected]

Khader Shameer†‡Northwell Health

New York, NY [email protected]

Lakshmi Subramanian ∗§New York University

60 5th AveNew York, NY [email protected]

ABSTRACTApplying deep learning methods to mammography assessment hasremained a challenging topic. Dense noise with sparse expressions,mega-pixel raw data resolution, lack of diverse examples have allbeen factors a�ecting performance. �e lack of pixel-level groundtruths have especially limited segmentation methods in pushing be-yond approximately bounding regions. We propose a classi�cationapproach grounded in high performance tissue assessment as analternative to all-in-one localization and assessment models that isalso capable of pinpointing the causal pixels. First, the objective ofthe mammography assessment task is formalized in the context oflocal tissue classi�ers. �en, the accuracy of a convolutional neuralnet is evaluated on classifying patches of tissue with suspicious�ndings at varying scales, where highest obtained AUC is above0.9. �e local evaluations of one such expert tissue classi�er isused to augment the results of a heatmap regression model andadditionally recover the exact causal regions at high resolution asa saliency image suitable for clinical se�ings.

1 INTRODUCTIONBreast Imaging-Reporting and Data System (BI-RADS) has remainedthe standard for categorizing the presence of benign and malig-nant tissue in mammography since its introduction in 1993. Toapply it in practice, clinicians undergo years of training to iden-tify minute details indicating various breast conditions which areindicative of cancer to varying degrees. In some cases both localand high-level expressions must be considered. For instance, sickleshaped calci�cations are indicative of plasma cell mastitis whichis only benign when oriented towards the nipple. [15] �e resultof an initial BI-RADS inspection is the basis for bounding outlineswhich is commonly seen in clinical practice and provided in digitalmammography datasets.

�e Digital Database for Screening of Mammography (DDSM)has been a popular starting point for applying deep learning tomammography, providing a upwards of 2600 cases with both CCand MLO angle x-rays. As with most clinically annotated tissuedata, DDSM also makes available overlays of hand-drawn outlinescircumscribing local groupings of suspicious tissue. When observedvisually these outlines serve to indicate the general area a�ectedby abnormal growth for visual emphasis, but they are far frompixel accurate labels and only serve as a substitute for ground truth.Unless a model learns the semantics of these boundaries, training

∗Courant Institute of Mathematical Sciences, New York University†Department of Information Services, Northwell Health‡Center for Research Informatics and Innovation, Northwell Health§Center for Data Science, New York University

directly on the available annotations is unlikely to yield boundstighter than what is available.

�e proposed approach a�empts to rectify the issue of looseannotations as well as bridge high accuracy recently demonstratedby breast tissue classi�ers [11, 12] by making them available asalternative source of inference and localization in mammographyassessment. An auxiliary classi�er is trained to demonstrate expert-level classi�cation accuracy on local tissue samples at varying scales.�en, its pixel-wise gradient which can be used to show saliency iscomputed at �xed strides covering regions of interest proposed bya heatmap regression model. �e saliency of the classi�cations aretreated as strictly tighter bound indicators of suspicious tissue andare aggregated in the original input size, preserving the highest�delity output resolution possible.

2 BACKGROUND AND RELATEDWORK�e need for computer aided diagnosis tools is felt strongly in mam-mography. �e e�ectiveness of existing procedures is debated withstudies showing upwards of 30% of malignant growths to be retro-spectively identi�able in past screenings. [4, 8] Assessment o�enextend to blind peer screenings since the accuracy and reproducibil-ity of mammography results vary with the expertise of clinicians.Human error in interpretation of medical data is an ongoing topicof study [9] and leveraging machine intelligence is certainly ana�ractive option.

While the e�ectiveness of deep models are well demonstratedin medical imaging, their high parameter count poses an issue indomains where input resolution is mega-pixels in size. Dealingwith large data size and general model performance is a broadlystudied issue. Classi�ers exist which intelligently sample largerdata, as proposed by Mnih et al. [14] but heuristic sampling runsthe risk of missing co-occurring expressions.

�e recovery of reasoning information from deep models hasalso been stressed as a necessary step of validation in the medicaldomain. Several approaches exist in this regard, including usingglobal average pooling [18] to induce visual groupings of inferredcausal regions. A more direct approach in saliency veri�cation isto use gradient information with respect to pixels.

In practice, clinical assessment of mammography alternates be-tween phases of broad localization and detailed assessment. Severalbodies of work exist on replicating these steps separately as wellas proposing an end-to-end deep model. Classi�cation on pre-localized tissue samples have been widely successful [11] partlyby deferring the problem of localization itself. �e concept of de-tecting expressions at multiple scales has also been explored [13].Localization itself has been an ongoing area of research. [1, 10]Among end-to-end models, the importance of preserving native

arX

iv:1

807.

0309

5v1

[cs

.CV

] 3

0 Ju

n 20

18

resolution and the necessity for multi-view context is stressed inGeras et al. [6] where a classi�er views both MLO and CC angles.�e saliency visualization method in this study was chosen basedon their demonstrated e�ectiveness in [6]. Ertosun et al. proposes atwo neural net setup where region proposal only occurs if a pre-liminary classi�er determines the presence of suspicious tissue.[5]

�e joint segmentation and classi�cation model R-CNN formu-lated in Girshick et al. [7] has also been applied to the medicaldomain with notable success: in brain scan imaging [2] and mam-mography as well. [16] Mammography assessment incur additionalchallenges for several reasons. X-rays are not cross sections. �reedimensional structures are compressed giving rise to highly noisyexpressions. Separation between tumorous and clear tissue can alsobe a gradual, unlike distinct organs. Ribli et al. [16] demonstratesthe highest performance in mammography segmentation.

Deep heatmap regression models [3] have also demonstratede�ectiveness in human pose-estimation se�ings and have not seenwide adoption in the medical image se�ing. While they are e�ec-tive for images with guaranteed semantic consistency they requireadditional �ltering layers to learn complex dependencies.

3 METHODSWe brie�y formalize a bo�om-up perspective on the relevanceof local tissue classi�cations to an overall assessment. A deepneural net classi�er with a so�max layer infers the likelihood of aclass given an example p(y |x). In a se�ing where inferring on fullexample x is not desirable, it can be substituted by �xed size and�xed stride samples x ′i j . Each x ′i j is used to infer an intermediateassessment zi j . With the substitution, the original objective can beapproximated as:

p(y |x) ≈∑i j

p(y |zi j : Ci j ) · q(zi j |x ′i j )

where Ci j parameterizes necessary positional context of tissuepatch x ′i j with respect to full example x .

Data Preparation. All scans were �rst �ipped to one orientation,then cropped without downsampling to 2048-by-2048 pixel regionsright-aligned on the nipple. From this base image, tissue sampleswere generated for testing at three di�erent magni�cations: ×0.5,×0.33 and ×0.25. �is procedure was repeated for all available case�les in the DDSM database. Le� and right scans from individualswere treated as independent samples to increase the variety ofavailable data. Tissue samples for positive and negative cases weregenerated using separate methods. Findings-positive samples weregenerated by directly by sampling at the center of mass of annotatedregions. Very large annotations were ignored during training toensure that the subject of the annotation was clearly captured.Negative samples were arbitrarily generated at �xed strides of 32pixels for regions which did not intersect with any annotations. Alltissue samples were treated as independent examples. �e contentof the samples ranged from empty space, partially visible tissue,and fully covered by tissue.

Figure 1: Local result aggregation method at ×0.5 magni�ca-tion of raw resolution with �xed stride of 64 pixels.

Tissue Classi�er. Tissue classi�cation has been approached frommany di�erent directions. Recent results have shown deep convo-lutional neural nets obtain state-of-the-art performance for simplelesion identi�cation. [11, 12] Our tissue-scale expert q(zi j |x ′i j ) wasalso de�ned as a convolutional neural net whose architecture hasbeen adopted from VGG, which has demonstrated the ability toclassify hundreds of objects. [17] �e objective of the classi�er inour se�ing was de�ned as predicting the absence or presence ofsubstantial �ndings which could be benign or malignant in patchesof tissue. �is corresponds to BI-RADS assessments of either ≤ 1or ≥ 2.

Aggregated Local Results. To gather local presence informationusing the trained tissue classi�er, it was repeatedly evaluated onthe base magni�ed mammography at moving strides of 64 pixelsand to aggregate its local predictions p(zi j |x ′i j ) at o�sets i, j. �ecollected predictions at i, j contain condensed global informationof suspicious �ndings to the best of the tissue classi�er’s ability.

Baseline Heatmap Regression. A deep heatmap regression model[3] was chosen to roughly learn clinical annotations given mam-mography scans at a downsampled size of 256 pixels. �e inputimage undergoes several layers of convolution with no fully con-nected layers. �is implies heat centers in the output image aredirect transformations of expressions found at same location in theinput. �is inherent property can be advantageous at identifyingsmall structures which must be distinguished from surroundingnoise, but is a trade-o� against models with fully connected layerswhich learn complex dependencies. First a baseline version of theheatmap model was tested. �en, a modi�cation was made to intro-duce the local aggregated results by concatenating them to a �naloutput layer, then adding additional convolution layers to train theregression between the aggregations and the top-level heatmap.

Pixel-wise Saliency. �e entropy, or con�dence, given by predic-tion p(y |x) is expressed as

H (y |x) = −∑y′∈C

p(y′ |x) logp(y′ |x)

�e saliency of a given prediction can be visualized by computingthe gradient at each pixel with respect to the entropy | δHi j

δx ′i j|. [6] In

the absence of y′ during inference, the label inferred by the modelitself is substituted. At the tissue scale this approach provides pin-point annotations which are far tighter than circumscribing bounds.

2

Figure 2: Heatmap regression with auxiliary input of localaggregated results.

During full-scale saliency evaluation, only the saliency of q(zi j |x ′i j )whose activation zi j appears inside the heatmap inferred by thetop-level heatmap regression model are preserved and aggregated.

4 RESULTS4.1 Classi�cation

Tissue Classi�er. Of the three classi�ers trained at varying mag-ni�cations, ×0.5 yielded the best result. �is was not surprisinggiven the loss of details with subsequent downsampling. However,as ×0.5 itself was not at max resolution, an acceptable level of down-sampling may not impact predictive ability; notably there are visualartifacts from the x-ray digitization process which disappear withdownsampling. �ere were a greater abundance of tissue regionsthat did not show suspicious expressions, causing an imbalancebetween positive and negative examples. �is was mitigated by ap-plying random brightness, rotations, o�sets, �ipping, and croppingduring training.

Performance of tissue classi�cation approached state-of-the-artand was comparable to results from previous studies. [11, 12]

Speci�c types of calci�cations are inherently di�cult to distin-guish from surrounding tissue and would necessarily require bothMLO and CC angles to identify. Clinically, this is also the under-lying purpose of two standardized angles for mammography. Wenoted in most failure cases that annotations would be present forlesions that are completely obscured and indistinguishable fromsurrounding tissue in one angle, but is visible from another angle.It is standard in clinical se�ings to then annotate the location of thetissue from both angles. �e inclusion of such examples in traininglowers predictive power in classi�cation under independent treat-ment of all patches. �is semantic trait is resolved in one way andgives strong motivation to a multi-view setup as proposed in Geraset al. [6] In the tissue classi�cation case, annotated regions can bepaired from both CC and MLO angles, then shown simultaneouslyto a classi�er which learns one set of features using either region.

Baseline Heatmap Regression. As de�ned, the heatmap regressionmodel undergoes max-pooling steps which result in a �nal heatmapoutput size of 32-by-32 pixels. �e inferred heatmaps were scoredusing pixel-wise mean squared error (MSE) against downsampledclinical annotations which were Gaussian blurred to reduce thee�ect of annotation semantics and improve generalizability. Withthe introduction of locally aggregated results, a small improvement

Table 1: Data split between stages of evaluation (�ndingspositive / negative).

Stage Tissue Patches Full Mammography (CC)Training 3038 / 311452 1432 / 1984Validation 32 / 32 10 / 10Testing 256 / 256 100 / 100

Figure 3: ROC on held-out data of tissue assessment task.Le� to right: Classi�er trained at ×0.5, ×0.33, and ×0.25 mag-ni�cations.

Table 2: Breakdown of binary prediction on held-out tissuesamples (percentage out of 256 reserved samples).

Model TP FP TN FN Total ErrorTissue ×0.5 47.7 9.0 41.0 2.3 11.3Tissue ×0.33 29.7 3.9 46.0 20.3 24.2Tissue ×0.25 33.6 6.3 43.8 16.4 22.7

Table 3: Mean-Squared Error between predicted and anno-tated regions with malignant �ndings at 256-pixel outputresolution.

Model Baseline + local results (×0.5)Training 0.12 2.25Validation 6.48 6.04

in loss was noted in validation with a larger average loss duringtraining, which indicates a degree of reduced over��ing given theadditional information.

�e experiment gives motivation to a bo�om-up approach forfull BI-RADS localization and classi�cation as opposed to a top-down approach which requires high computational resources andthe need to si� through a signi�cant amount of noise. Notably thereexists a discrepancy between tissue classi�cation accuracy whichis expected to be produce above 0.9 AUC while overall mammogra-phy assessment accuracy o�en falls short except for very speci�csuccessful cases such as Ribli et al. [16] �e discrepancy is mainlyexplained by di�culty in localization, but with su�ciently accuratetissue classi�er q(zi j |x ′i j ) and informative positional context encod-ing Ci j , aggregated classi�cations should yield the foundations fora highly accurate top-level assessment.

3

Figure 4: Examples of heatmap inference on held-out mam-mograms.

Figure 5: Examples of held-out tissue samples which werecorrectly classi�ed and their saliency displayed.

Figure 6: Examples of pixel-level saliency extracted shownin comparison to the original annotations.

4.2 Saliency RetrievalSaliency from local tissue classi�cations were considered in the �nalvisualization only if its activation occurred within correspondingregions inside the �nal predicted heatmap and if the activationitself indicated a positive prediction. Additionally, the predictedheatmap served as a tight bound and any noise placed by saliencyoutside its boundaries were ignored. �e locally computed saliencyimages were additively aggregated to their global position in theoverall scan at their corresponding magni�cation level. �e overalle�ect of the procedure highlighted the o�ending tissue expressionswith pinpoint accuracy.

5 CONCLUSIONClassi�cation tasks on dense tissue images pose several challengesto vanilla deep learning methods. �e problem of mammographyassessment was broken down into a classi�cation task specializ-ing in discriminating tissue expressions locally then a full contextheatmap regression model which guides the aggregation of localresults. High accuracy was demonstrated for tissue scale classi�-cation and the results of the proposed saliency evaluation methodwas demonstrated to enhance baseline clinical annotations.

REFERENCES[1] 2016. Representation learning for mammography mass lesion classi�cation with

convolutional neural networks. Computer Methods and Programs in Biomedicine127 (2016), 248 – 257.

[2] Zeyne�in Akkus, Al�ia Galimzianova, Assaf Hoogi, Daniel L Rubin, and BradleyErickson. 2017. Deep Learning for Brain MRI Segmentation: State of the Art andFuture Directions. 30 (06 2017).

[3] Adrian Bulat and Georgios Tzimiropoulos. 2016. Human pose estimation viaConvolutional Part Heatmap Regression. CoRR abs/1609.01743 (2016).

[4] Heidi D Nelson, Ellen S O’Meara, Karla Kerlikowske, Steven Balch, and DianaMigliore�i. 2016. Factors Associated With Rates of False-Positive and False-Negative Results From Digital Mammography Screening: An Analysis of RegistryData. 164 (01 2016).

[5] M. G. Ertosun and D. L. Rubin. 2015. Probabilistic visual search for masses withinmammography images using deep learning. In 2015 IEEE International Conferenceon Bioinformatics and Biomedicine (BIBM). 1310–1315. h�ps://doi.org/10.1109/BIBM.2015.7359868

[6] Krzysztof J. Geras, Stacey Wolfson, S. Gene Kim, Linda Moy, and KyunghyunCho. 2017. High-Resolution Breast Cancer Screening with Multi-View DeepConvolutional Neural Networks. CoRR abs/1703.07047 (2017).

[7] Ross B. Girshick, Je� Donahue, Trevor Darrell, and Jitendra Malik. 2013. Richfeature hierarchies for accurate object detection and semantic segmentation.CoRR abs/1311.2524 (2013).

[8] Solveig R. Ho�, Anne-Line Abrahamsen, Jon Helge Samset, Einar Vigeland,Olbjrn Klepp, and Solveig Hofvind. 2012. Breast Cancer: Missed Interval andScreening-detected Cancer at Full-Field Digital Mammography and Screen-FilmMammography�� Results from a Retrospective Review. Radiology 264, 2 (2012),378–386. h�ps://doi.org/10.1148/radiol.12112074 PMID: 22700555.

[9] Shameer Khader, Kipp Johnson, Benjamin Glicksberg, Joel T Dudley, and ParthoSengupta. 2018. Machine learning in cardiovascular medicine: are we there yet?(01 2018), heartjnl–2017.

[10] �ijs Kooi, Geert Litjens, Bram van Ginneken, Albert Gubern-Merida, Clara I.Sanchez, Ritse Mann, Ard den Heeten, and Nico Karssemeijer. 2017. Large scaledeep learning for computer aided detection of mammographic lesions. MedicalImage Analysis 35 (01 Jan 2017), 303–312. h�ps://doi.org/10.1016/j.media.2016.07.007

[11] Daniel Levy and Arzav Jain. 2016. Breast Mass Classi�cation from Mammogramsusing Deep Convolutional Neural Networks. CoRR abs/1612.00542 (2016).

[12] Daniel Levy and Arzav Jain. 2016. Breast Mass Classi�cation from Mammogramsusing Deep Convolutional Neural Networks. CoRR abs/1612.00542 (2016).

[13] William Lo�er, Greg Sorensen, and David Cox. 2017. A Multi-Scale CNN and Cur-riculum Learning Strategy for Mammogram Classi�cation. CoRR abs/1707.06978(2017).

[14] Volodymyr Mnih, Nicolas Heess, Alex Graves, and koray kavukcuoglu. 2014.Recurrent Models of Visual A�ention. In Advances in Neural Information Pro-cessing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, andK. Q. Weinberger (Eds.). Curran Associates, Inc., 2204–2212.

[15] Ajay Aroor Rao, Jennifer Feneis, Chloe Lalonde, and Haydee Ojeda-Fournier.2016. A Pictorial Review of Changes in the BI-RADS Fi�h Edition. RadioGraphics36, 3 (2016), 623–639. h�ps://doi.org/10.1148/rg.2016150178 PMID: 27082663.

[16] Dezso Ribli, Anna Horvath, Zsuzsa Unger, Peter Pollner, and Istvan Csabai. 2018.Detecting and classifying lesions in mammograms with Deep Learning. Scienti�cReports 8, 1 (2018), 4165. h�ps://doi.org/10.1038/s41598-018-22437-z

[17] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Net-works for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014).

[18] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Tor-ralba. 2015. Learning Deep Features for Discriminative Localization. CoRRabs/1512.04150 (2015).

4

https://doi.org/10.1109/BIBM.2015.7359868

https://doi.org/10.1109/BIBM.2015.7359868

https://doi.org/10.1148/radiol.12112074

https://doi.org/10.1016/j.media.2016.07.007

https://doi.org/10.1016/j.media.2016.07.007

https://doi.org/10.1148/rg.2016150178

https://doi.org/10.1038/s41598-018-22437-z

Documents

Mammography Assessment using Multi-Scale Deep Classifiers · 2019. 10. 8. · Mammography Assessment using Multi-Scale Deep Classifiers Ulzee An∗ New York University 251 Mercer