Download pdf - Understanding the Mechanism of Deep Learning Framework for … · 2019-03-05 · such systems with deep learning approaches because most developers simply focused on the efficacy

Understanding the Mechanism of Deep Learning Framework for

Lesion Detection in Pathological Images with Breast Cancer

Wei-Wen Hsu1,2, Chung-Hao Chen1, Chang Hoa2, Yu-Ling Hou2, Xiang Gao2,

Yun Shao3, Xueli Zhang3, Jingjing Wang3, Tao He2, Yanghong Tai3

1Department of Electrical and Computer Engineering, Old Dominion University, Norfolk, U.S.A.

2The Institute of Big Data Technology, Bright Oceans Corporation, Beijing, China 3Department of Pathology, The Fifth Medical Center of the PLA General Hospital , Beijing, China

Abstract

The computer-aided detection (CADe) systems are developed to assist

pathologists in slide assessment, increasing diagnosis efficiency and reducing missing

inspections. Many studies have shown such a CADe system with deep learning

approaches outperforms the one using conventional methods that rely on hand-crafted

features based on field-knowledge. However, most developers who adopted deep

learning models directly focused on the efficacy of outcomes, without providing

comprehensive explanations on why their proposed frameworks can work effectively.

In this study, we designed four experiments to verify the consecutive concepts, showing

that the deep features learned from pathological patches are interpretable by domain

knowledge of pathology and enlightening for clinical diagnosis in the task of lesion

detection. The experimental results show the activation features work as morphological

descriptors for specific cells or tissues, which agree with the clinical rules in

classification. That is, the deep learning framework not only detects the distribution of

tumor cells but also recognizes lymphocytes, collagen fibers, and some other non-cell

structural tissues. Most of the characteristics learned by the deep learning models have

summarized the detection rules that can be recognized by the experienced pathologists,

whereas there are still some features may not be intuitive to domain experts but

discriminative in classification for machines. Those features are worthy to be further

studied in order to find out the reasonable correlations to pathological knowledge, from

which pathological experts may draw inspirations for exploring new characteristics in

diagnosis.

Key-words: CADe system, Lesion Detection, Activation Features, Visual

Interpretability

Introduction

Biomedical image analysis is a complex task which relies on highly-trained

domain experts, like radiologists and pathologists. In pathology, the manual process of

slide assessment is laborious and time-consuming, and wrong interpretations may

happen owing to fatigue or stress in specialists. Besides, there has been an insufficient

number of registered pathologists, as a result, the workload for pathologists turns

heavier, becoming a problem in pathology. Recently, the techniques of image

processing and machine learning have significantly advanced, and the computer-aided

detection/diagnosis (CADe/CADx) systems[1-4] were developed to assist pathologists

in slide assessment. Working as a second opinion system, it is designed to alleviate the

workload of pathologists and avoid missing inspections.

In machine learning, many studies used to focus on the development of classifiers.

However, data scientists found feature extraction for data representation the bottleneck

of performances in tasks of classification and detection. Therefore, feature engineering

that concentrates on the methods to extract features and make machine learning

algorithms work effectively became more and more critical for performances. In

representation learning, scientists aim to develop the techniques that allow a system to

automatically discover the representations needed for classification or detection from

raw data. Since 2012[5], the framework of Deep Convolutional Neural Networks

(DCNN) has achieved outstanding performances on many applications of computer

vision. Many studies have shown that the classification results with features extracted

from deep convolutional networks, known as activation features, outperform the results

with the conventional approaches using hand-crafted features[1, 4]. Accordingly, the

deep learning framework has been widely adopted for the tasks of histopathological

image analysis. Nonetheless, such CADe/CADx systems with deep learning

approaches are hard to be accepted by medical specialists since the deep learning

framework is an end-to-end fashion that takes raw images as inputs and derives the

outcomes directly. It is deficient in the theoretical explanation about the mechanism for

such systems with deep learning approaches because most developers simply focused

on the efficacy of outcomes, without providing a comprehensive mechanism for their

proposed frameworks[6]. Consequently, many medical specialists claim the deep

learning framework a “black box” and doubt about the feasibility of such systems in

clinical practice.

In the framework of DCNN, it comprises convolutional layers and fully connected

layers to perform feature extraction and classification respectively during the process

of optimization. In convolutional layers, local features such as colors, end-points,

corners, and oriented-edges are collected in the shallow layers. These local features in

the shallow layers are integrated into larger structural features like circles, ellipses,

specific shapes or patterns when layer goes deeper. Afterwards, these features of

structures or patterns constitute the high-level semantic representations that describe

feature abstraction for each category[7]. On the other hand, in fully connected layers, it

takes the extracted features from the convolutional layers as inputs and works as a

classifier, well known as Multilayer Perceptron (MLP). These fully connected layers

encode the spatial correspondences of those semantic features and convey the co-

occurrence properties between patterns or objects[8].

Many studies have worked on the visual interpretability of deep learning models

on the datasets of natural images[7, 9-12] and showed the mechanism of deep learning

frameworks follows the prior knowledge for each category in classification. The

process of the system is concordant with humans’ intuitions in tasks of image

classification[13]. However, in pathological image analysis, there has been insufficient

for explanations about the proposed systems using deep learning frameworks so that

the feasibility of such computer-aided systems keeps being questioned by the medical

specialists.

The purpose of this study is to provide visual interpretability to explain the

mechanism of the deep learning framework in tasks of lesion detection for histology

images. We studied the properties of the activation features extracted by the deep

learning models for lesion detection under the view at high magnification (X40). Four

experiments were designed consecutively to show that the extracted activation features

are (i) transferrable to work with other classifiers, (ii) meaningful in classification, (iii)

interpretable by the domain knowledge of pathology, and (if) enlightening for exploring

new cues in pathological image analysis. To demonstrate that, the classifiers, such as

Support Vector Machine (SVM) and Random Forests (RF), were used in our

experiments to replace the fully connected layers to decompose the end-to-end

framework so that we can focus on the characteristics of feature extraction in the

convolutional layers. Therefore, which classifier outperforms among the others or

whether the substitution of fully connected layers can strike better performances are not

the aim of this study.

Materials and Methodology

In this study, 27 H&E stained specimens of breast tissue with Ductal Carcinoma

in Situ (DCIS) were collected and digitized in the format of Whole-Slide Images (WSIs).

All lesions of DCIS were labeled in blue by a registered pathologist and confirmed by

another registered pathologist, as shown in Figure 1-(a). To perform lesion detection

through WSIs, many small patches were sampled under the view at high magnification

(X40), called patching[2, 14]. There are two kinds of sampling sets: positive set and

negative set. The positive set collected the patches with tumorous cells by sampling

from the annotated regions. On the other hand, the patches with normal cells or normal

tissues were sampled outside the annotated regions, comprising the negative set. There

were about 140k patches that were sampled from the total labeled regions of DCIS. To

balance the training data set, the same numbers of patches were also collected for the

negative set. As a result, the total training data comprise about 280k patches. The

training procedure for the deep learning framework in tasks of lesion detection is shown

as Figure 1-(b).

(a)

(b)

Figure 1. The annotations of lesions and training the DCNN model.

(a) Fully-labeled lesions of DCIS in a whole slide image.

(b) The training procedure of the deep learning framework for lesion detection.

In our designed experiments, the pre-trained AlexNet[5] model on the ImageNet

dataset was used to perform transfer learning[15]. Since we used the classifiers of

support vector machine and random forests to replace the fully connected layers to

achieve decomposition of the end-to-end framework, the feature size for each patch is

9216 by 1 using the pre-trained AlexNet model. Such dataset would be too large for the

classifiers like SVM and RF if all 280k sampling patches were used in training.

Therefore, to shrink the size of the dataset to make training feasible, 20k patches

(positive: negative = 1:1) were randomly selected from the total dataset as the final

training dataset to fine-tune the deep learning model and learn the activation

features[16]. The extracted activation features were presented by the scores from the

results of forward propagation through the convolutional layers. For performance

evaluation, another 2k patches (positive: negative = 1:1) were further collected from

the total dataset as the testing dataset to compute out-sample accuracy by the trained

DCNN models and several classifiers, including Logistic Regression (LR), Support

Vector Machine (SVM), and Random Forests (RF).

To observe the pattern for each activation feature that was used in patch

classification, the size of the Field-of-View (FOV) was computed to derive the

mappings between the neurons and their corresponding FOVs in the input image, as

shown in Figure 2. In Figure 2, the number of channels in the assigned convolutional

layer, i.e., 256, means the number of patterns that were learned in the training procedure.

The neuron in each channel represents the spatial orientation with respect to its

corresponding FOV in the input image. For the neuron that gets high activation score,

it means the learned pattern has a high response on the corresponding region (FOV) in

the image, reflecting the matching level between them. For visualization[17], the

activation scores of neurons in the assigned convolutional layer were recorded from all

patches and ranked by the scores for each channel. Then the patches with top 100

activation scores for each channel were collected with the corresponding high-response

region highlighted in a yellow bounding box. We also visualized the activation heatmap

and resized it to the same size as the input image to have better observations on the

spatial distribution of the learned patterns. Figure 3 shows one of the examples in our

experiment of visualization.

Figure 2. The mappings between neurons and their corresponding FOVs.

Figure 3. The patch (on the left) with the highest activation score in channel No. 49

and its corresponding activation heatmap (on the right).

Experiments and Results

Exp #1 Feature Extraction in DCNN

Motivation: Even though the deep learning model is an end-to-end structure, it, in fact,

can be decomposed into two parts: convolutional layers for feature extraction and fully

connected layers for classification. The goal of this experiment is to verify that the

features extracted by the deep learning models are meaningful in classification so that

those features are capable of incorporating with other classifiers, rather than being

exclusive to neural networks.

Hypothesis: Features extracted from the convolutional layers are meaningful in

classification and can work with other classifiers as well.

Models: The end-to-end AlexNet model was used in training and testing for

comparisons, and the structure is shown in Figure 4. For the control group, the fully

connected layers in AlexNet were replaced by other classifiers, such as Logistic

Regression (LR), support vector machine (SVM), and Random Forest (RF), as shown

in Figure 5.

Figure 4. The structure of the end-to-end AlexNet model.

Figure 5. Classifier (LR/SVM/RF) was applied to replace the fully connected layers

as the control group.

Results and Discussion: The performances of different models in training and testing

were listed in the column of in-sample accuracy and out-sample accuracy respectively

in Table 1. The testing results show tiny differences in accuracy among models. That

means the features learned by the deep learning models are not restricted to the end-to-

end neural networks. Those features are meaningful in classification and can

incorporate with other classifiers as well. From Table 1, it is noteworthy that overfitting

seemed to occur on the model using Random Forest, on the other hand, the model using

Logistic Regression has the highest out-sample accuracy among all. It implies the

simpler model may strike a better performance in the out-sample dataset due to its better

property of generalization.

Table 1. Comparisons among four different classifiers.

Exp #2 Visualization of Model

Motivation: The deep learning model has demonstrated its capability in distinguishing

patches with or without lesions. And the activation features learned from the DCNN

models are meaningful in classification, shown in the previous experiment. We aim to

find out the patterns that contribute to the classifier in decision making to understand

the mechanism of deep learning model from the pathological view.

Hypothesis: Most activation features agree with the clinical rules in pathology.

Model: The trained AlexNet model from Exp#1 was used to visualize the activation

heatmap for each input patch, as shown in Figure 6. Forward propagation was

performed through the convolutional layers for all patches to derive the corresponding

activation heatmaps, and all activation scores were recorded and ranked for all 256

channels.

Figure 6. The activation heatmap was generated from the output of forward

propagation through the convolutional layers in the trained model for each channel.

Results and Discussion: The sampling patches and the corresponding heatmaps for the

selected channels were listed in Figure 7, classified by the category in pathology. From

observations, the patterns learned from DCNN models are the morphological

descriptors for specific cells or tissues, working as detectors. And the activation

heatmaps reflect the spatial distribution of the learned patterns from the input patches.

Interestingly, in this experiment, only the regions with lesions were manually labeled

by the pathologists; however, we found the deep learning models are able to discover

the main components in the images and categorize them by their characteristics. That

is, in the task of lesion detection, the deep learning models not only can detect the

distribution of tumor cells, but also recognize lymphocytes, collagen fibers, and some

other non-cell structural tissues such as luminal space, areas of necrosis and secretions.

The results show that the activation features learned from the DCNN models are in

accord with clinical insights in pathology and our hypothesis holds.

Figure 7. The activation heatmaps reflect the high response regions for each

channel, and many activation features agree with clinical insights in pathology.

Exp #3 Feature Reduction

Motivation: In tasks of image classification on the datasets of natural images, the

spatial structure of patterns is an essential characteristic for the deep learning models to

recognize the objects. For example, eyes are detected above a nose or a mouth if there

is a human face in the image. However, in our experiments, since patches were sampled

under the view at high magnification (X40), cells and tissues are arbitrarily distributed

in the small patches, as shown in Figure 8. The characteristic of patterns’ spatial

distributions seems meaningless and irrelevant in the task of patch classification here.

Hypothesis: Characteristic of patterns’ spatial orientations can be ignored in patch

classification, and feature reduction can be applied to speed up the system.

Figure 8. Cells and tissues are arbitrarily distributed in the sampling patches.

Model: From the previous experiment, we know that the deep learning models can

recognize tumor cells, lymphocytes, and collagen fibers. Some of the learned activation

features can be regarded as detectors for these categories. Since we assumed the

information of spatial orientations for these elements could be ignored within the small

patches, the tasks of patch classification can be accomplished by checking if the lesion

exists without knowing its exact orientation. Accordingly, a 13 by 13 average pooling

layer was adopted to replace the original max pooling layer in Layer 5. The modified

model is shown in Figure 9. As a result, the total number of features for classification

was reduced from 6x6x256 (9216) to 1x1x256 (256). The size of features became its

1/36 compared with the original one.

Figure 9. The modified model that applied 13x13 average pooling layer to discard

spatial information.

Results and Discussion: For comparisons, the performances before and after feature

reduction were listed in Table 2. With the feature size that is 36 times smaller than the

original one, the out-sample accuracy can still remain at the same level or even slightly

better. That means the characteristic of spatial orientations is redundant and can be

discarded within the sampling patches, which proves the hypothesis. From the results,

it shows that constraining the complexity of model somehow can trade a better

generalization property to prevent the model from overfitting and strike a better out-

sample accuracy. Moreover, after applying feature reduction from 4096 to 256, the

system for lesion detection became 23% faster in execution. The performances were

improved in both efficacy and efficiency using the model that was modified based on

prior knowledge.

Table 2. Comparisons of performances before and after feature reduction.

Exp #4 Feature Selection

Motivation: After feature reduction, the same method of visualization in Exp#2 was

used to observe the patterns learned from the modified model in Exp#3. The

visualization results were summarized in Figure 10. The activation heatmaps here were

in size of 13x13 before resizing, and the corresponding size of FOVs for each neuron

is about the same size as a cancerous nucleus so that the high response regions can

reflect the distribution of tumor cells very well. Besides, we found deep learning models

can reveal the co-occurrence property of patterns by exploring from data. In Figure 11,

it shows that deep learning models not only focused on the characteristics of cancerous

nuclei but also noticed the effect of cytoplasmic clearing around those nuclei. In this

experiment, we want to dig into the extracted features to better understand the

mechanism of how the deep learning framework utilizes these 256 features from Exp#3.

Method: All 256 features from Exp#3 were partitioned into two groups. One group was

to collect the features that can convey clinical insights, which means the features can

work as detectors for specific cells or tissues, like the features collected in Figure 7 and

Figure 10, reported as “recognizable features” here. On the other hand, the rest of the

features that cannot be correspondent with a specific category in pathology belonged to

another group and were reported as “unrecognizable features.” Figure 12 shows an

example of the unrecognizable feature. From our observations, 43 features from the

group of “recognizable features” were correlated to either tumor cells or lymphocytes

and were selected manually in this experiment to further reduce the feature size. Besides,

another 43 features were selected randomly from the group of “unrecognizable features”

as the control group for comparisons.

Figure 10. Visualization of the modified model in Exp#3. The selected activation

features worked as detectors for specific cells or tissues.

(a) An activation feature that focused on the characteristics of cancerous nuclei.

(b) An activation feature that targeted on regions of cytoplasmic clearing

around cancerous nuclei.

Figure 11. Deep learning models are able to reveal the co-occurrence property of

patterns by exploring from the training data.

Hypothesis: In manual lesion inspection, the pathologists usually focus on different

types of cells and then determine whether those cells are cancerous or not by the

morphological properties. Similarly, we argue that if we further reduce the feature size

by just selecting the cell-structure features, lesion detection should also be achieved.

And the model trained with cell-structure features is supposed to outperform the model

trained with “unrecognizable features” under the same feature size since they are more

useful and important from the pathological view.

Results and Discussion: Here we only used Random Forest as the classifier to have

constant comparisons among all scenarios of performances starting from our first

experiment. The results in comparisons were shown in Table 3. The training set of 43

features that were related to tumor cells or lymphocytes was denoted as (43) in Table 3.

And the set with randomly selected 43 features from the group of “unrecognized

features” was denoted as (43). After feature selection, the results show that

performances decreased for both models, compared with the one using all 256 features.

And the model trained with the selected 43 cell-structure features outperformed the

model trained with the 43 unrecognizable features. Surprisingly, the model trained with

the 43 features randomly selected from the group of “unrecognizable features” can still

strike the out-sample accuracy to 94% above. It implied that those features which were

unrecognizable by humans were useful for machines and discriminative in

classification statistically. Accordingly, the top *43 important features ranked by the

classifier of Random Forests out of all 256 features were further collected and the set

was denoted as (*43). And the model trained with the top *43 important features

outperformed the model trained with the 43 cell-structure features. Analyzing the

members in the feature set of (*43), 33 features were from the group of “recognizable

features,” in which 14 features were related to tumor cells or lymphocytes. And the rest

10 features were from the group of “unrecognizable features.” Figure 12 is an example

of the unrecognizable feature that was discriminative in detecting the patches with

lesions. The heatmaps in Figure 12 show the activation feature drives high response to

those cytoplasmic parts of the tumor cells near interstitial spaces. These discriminative

but unrecognizable features are worth to be further studied in order to find out the

reasonable correlations to pathological knowledge and may facilitate the research of

new characteristics in diagnosis.

Table 3. Comparisons of performances before and after feature selection.

Figure 12. An example of the unrecognizable feature.

Conclusions

In this study, four experiments were conducted to research into the properties of

the activation features learned by the DCNN models. In the first experiment, we verified

that the activation features are transferable and meaningful in classification. By

visualization in the second experiment, we found many activation features can work

like morphological descriptors to detect specific cells and tissues, and the results are

accordant to the category in pathology. In the third experiment, we modified the model

based on prior knowledge to strike better performances in both efficacy and efficiency.

And we further ranked all features by importance to compare views between humans

and machines in the fourth experiment. We found more than half of the extracted

features were interpretable by pathological knowledge, whereas the rest unrecognizable

features seemed discriminative in classification. The deep learning models are good at

summarizing rules in classification. And those rules learned from big data should be

further study to facilitate the research for both the medical field and applications of

artificial intelligence.

References

[1] A. Janowczyk and A. Madabhushi, "Deep learning for digital pathology image

analysis: A comprehensive tutorial with selected use cases," Journal of

Pathology Informatics, Original Article vol. 7, no. 1, pp. 29-29, January 1, 2016

2016.

[2] D. Wang, A. Khosla, R. Gargeya, H. Irshad, and A. H. Beck, "Deep Learning for

Identifying Metastatic Breast Cancer," 2016.

[3] B. E. Bejnordi et al., "Automated Detection of DCIS in Whole-Slide H&E Stained

Breast Histopathology Images," IEEE Transactions on Medical Imaging, vol. 35,

no. 9, pp. 2141-2150, 2016.

[4] A. Cruz-Roa et al., "Automatic detection of invasive ductal carcinoma in whole

slide images with convolutional neural networks," in Medical Imaging 2014:

Digital Pathology, 2014, vol. 9041, p. 904103: International Society for Optics

and Photonics.

[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet classification with deep

convolutional neural networks," presented at the Proceedings of the 25th

International Conference on Neural Information Processing Systems - Volume

1, Lake Tahoe, Nevada, 2012.

[6] B. Korbar et al., "Looking Under the Hood: Deep Neural Network Visualization

to Interpret Whole-Slide Image Analysis Outcomes for Colorectal Polyps," in

2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops

(CVPRW), 2017, pp. 821-827.

[7] M. D. Zeiler and R. Fergus, "Visualizing and Understanding Convolutional

Networks," in Computer Vision – ECCV 2014, Cham, 2014, pp. 818-833:

Springer International Publishing.

[8] Q.-s. Zhang and S.-c. Zhu, "Visual interpretability for deep learning: a survey,"

Frontiers of Information Technology & Electronic Engineering, vol. 19, no. 1, pp.

27-39, 2018/01/01 2018.

[9] K. Simonyan, A. Vedaldi, and A. Zisserman, "Deep Inside Convolutional

Networks: Visualising Image Classification Models and Saliency Maps,"

Computer Science, 2013.

[10] A. Mahendran and A. Vedaldi, "Understanding deep image representations by

inverting them," pp. 5188-5196, 2014.

[11] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, "Learning Deep

Features for Discriminative Localization," pp. 2921-2929, 2015.

[12] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra,

"Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based

Localization," 2016.

[13] Q. Zhang, R. Cao, F. Shi, Y. N. Wu, and S. C. Zhu, "Interpreting CNN Knowledge

via an Explanatory Graph," 2018.

[14] Y. Liu et al., "Detecting Cancer Metastases on Gigapixel Pathology Images,"

2017.

[15] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, "How transferable are features in

deep neural networks?," in International Conference on Neural Information

Processing Systems, 2014, pp. 3320-3328.

[16] F. A. Spanhol, L. S. Oliveira, P. R. Cavalin, C. Petitjean, and L. Heutte, "Deep

features for breast cancer histopathological image classification," in IEEE

International Conference on Systems, Man and Cybernetics, 2017, pp. 1868-

1873.

[17] Y. Xu et al., "Large scale tissue histopathology image classification,

segmentation, and visualization via deep convolutional activation features,"

Bmc Bioinformatics, vol. 18, no. 1, p. 281, 2017.