Assessment of lung damages from CT images using machine

IN DEGREE PROJECT MEDICAL ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2018

Assessment of lung damages from CT images using machine learning methods.

Bedömning av lungskador från CT-bilder med maskininlärning metoder.

QUENTIN CHOMETON

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ENGINEERING SCIENCES IN CHEMISTRY, BIOTECHNOLOGY AND HEALTH

iii

Abstract

Lung cancer is the most commonly diagnosed cancer in the world and itsfinding is mainly incidental. New technologies and more specifically artifi-cial intelligence has lately acquired big interest in the medical field as it canautomate or bring new information to the medical staff.

Many research have been done on the detection or classification of lungcancer. These works are done on local region of interest but only a few ofthem have been done looking at a full CT-scan. The aim of this thesis was toassess lung damages from CT images using new machine learning methods.

First, single predictors had been learned by a 3D resnet architecture: can-cer, emphysema, and opacities. Emphysema was learned by the networkreaching an AUC of 0.79 whereas cancer and opacity predictions were notreally better than chance AUC = 0.61 and AUC = 0.61.

Secondly, a multi-task network was used to predict the factors altogether.A training with no prior knowledge and a transfer learning approach usingself supervision were compared. The transfer learning approach showed sim-ilar results in the multi-task approach for emphysema with AUC=0.78 vs 0.60without pre-training and opacities with an AUC=0.61. Moreover using thepre-training approach enabled the network to reach the same performance aseach of single factor predictor but with only one multi-task network whichsave a lot of computational time.

Finally a risk score can be derived from the training to use these informa-tion in a clinical context.

KEYWORDS: Deep Learning, Artifical Neural Networks, Lung damages,CT-Scans, Multi-task learning, Transfer learning

iv

Acknowledgment

First and foremost, I would like to thank Carlos Arteta, my supervisor andguide during my project at Optellum Ltd. He helped me a lot with my workby bringing new ideas and always took time to answer all my questions.

I would also thank all the Optellum team for their welcoming and their help. Ilearned a lot by being part of this team and having the opportunity to discusswith them.

I would like to thank Dr. Chunliang Wang, my supervisor at KTH universityand Dr. Dmitry Grishenkov for guiding us through this master thesis project.

I would also thank Antoine Broyelle, my co-intern-mate at Optellum. Allour discussions and his ideas helped my project moved further. And all ourlunch times made bearable the British weather.

I would also thank Lottie Woodward who had the incredible patience toproofread a french style written report.

The author thanks the National Cancer Institute for access to NCI’s datacollected by the National Lung Screening Trial (NLST). The statements con-tained herein are solely those of the authors and do not represent or implyconcurrence or endorsement by NCI.

v

Nomenclature

AUC Area Under the Curve

CNN Convolutional Neural Network

CT Computed Tomography

DL Deep Learning

HU Hounsfield Unit

LR Learning Rate

ML Machine Learning

NLST National Lung Screening Trial

ROC Receiver Operating Characteristic

vi CONTENTS

Contents

1 Introduction 1

2 Materials and methods 22.1 Clinical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Image Annotations . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.2.1 Emphysema . . . . . . . . . . . . . . . . . . . . . . . . . 22.2.2 Opacities . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2.3 Consolidation . . . . . . . . . . . . . . . . . . . . . . . . 22.2.4 Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3.1 Normalization . . . . . . . . . . . . . . . . . . . . . . . . 42.3.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . 6

2.4 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . 62.4.1 Loss and Accuracy . . . . . . . . . . . . . . . . . . . . . 62.4.2 ROC and confusion matrix . . . . . . . . . . . . . . . . 7

2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5.1 Single factor prediction . . . . . . . . . . . . . . . . . . 7

2.5.1.1 Network . . . . . . . . . . . . . . . . . . . . . 72.5.1.2 Training . . . . . . . . . . . . . . . . . . . . . . 8

2.5.2 Combination of factors . . . . . . . . . . . . . . . . . . . 102.5.2.1 Network . . . . . . . . . . . . . . . . . . . . . 102.5.2.2 Full training . . . . . . . . . . . . . . . . . . . 102.5.2.3 Self-supervision . . . . . . . . . . . . . . . . . 112.5.2.4 Training with Pre-training . . . . . . . . . . . 12

3 Results 133.1 Single factor prediction . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.1.2 Emphysema . . . . . . . . . . . . . . . . . . . . . . . . . 133.1.3 Opacities . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Combining factors . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.1 Training with Pre-training . . . . . . . . . . . . . . . . . 173.2.2 Risk score . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Discussion 214.1 Main findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 General impact . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3 Comparison to other methods . . . . . . . . . . . . . . . . . . . 224.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

CONTENTS vii

5 Conclusion 24

References 26

A State of the Art 29A.1 Clinical Background . . . . . . . . . . . . . . . . . . . . . . . . 29

A.1.1 Lung Cancer . . . . . . . . . . . . . . . . . . . . . . . . . 29A.1.2 Nodules . . . . . . . . . . . . . . . . . . . . . . . . . . . 30A.1.3 Risks factors for nodule malignancy . . . . . . . . . . . 31

A.1.3.1 Nodule size [21] . . . . . . . . . . . . . . . . . 32A.1.3.2 Nodule Morphology [15] . . . . . . . . . . . . 32A.1.3.3 Nodule Location [21] . . . . . . . . . . . . . . 32A.1.3.4 Multiplicity [21] . . . . . . . . . . . . . . . . . 32A.1.3.5 Growth Rate [21] . . . . . . . . . . . . . . . . . 32A.1.3.6 Age, Sex, Race [21] . . . . . . . . . . . . . . . . 32A.1.3.7 Tobacco [21] . . . . . . . . . . . . . . . . . . . 33

A.1.4 Problem of detection . . . . . . . . . . . . . . . . . . . . 33A.2 Engineering Background . . . . . . . . . . . . . . . . . . . . . . 33

A.2.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . 33A.2.2 How does a deep learning network learn? . . . . . . . 34A.2.3 Transfer learning . . . . . . . . . . . . . . . . . . . . . . 36

A.2.3.1 Feature Extractor . . . . . . . . . . . . . . . . . 36A.2.3.2 Fine-tuning . . . . . . . . . . . . . . . . . . . . 36

A.2.4 Challenges with deep learning . . . . . . . . . . . . . . 37A.2.4.1 Architecture selection in transfer learning . . 37A.2.4.2 Number and quality of data: pre-processing . 40A.2.4.3 Time-processing (CPU and GPU) . . . . . . . 41A.2.4.4 Overfitting . . . . . . . . . . . . . . . . . . . . 41A.2.4.5 Performance evaluation . . . . . . . . . . . . . 42

1 INTRODUCTION 1

1 Introduction

The incidence of new cancer cases annually is 454.8 per 100,000 human beingin 2016 [14]. Additionally, lung cancer represents around 20% of deaths dueto cancer. Cancer, in general, is not a well-known disease and most of thecases are discovered too late to be treated. The main challenge nowadays isto detect and predict cancer as soon as possible, in order to treat it in the bestpossible way. For the past 3-4 years, artificial intelligence has been develop-ing in the medical field in order to provide useful tools helping for a betterdetection and prediction.

Lung cancer screening or incidental findings are the two main ways todetect lung cancer in a patient. Screening should normally be the main wayof finding lung cancer, as it is for breast cancer. But in most of the countriessuch as the UK, there is no national screening programme and most of thecases are incidental findings.

Incidental findings mean that lung nodules are found while doing an-other exam than lung targeted. For example, nodules can be found whiledoing a heart or a liver CT-scan. The main problem now is that radiologistsare not trained to sort the nodules and detect whether they are cancer or be-nign ones. These patients should be reported to a pulmonologist which willin most of the cases ask for a chest CT-scan which lead to more radiation ex-posure for the patient. In the worst case, the patient’s report never reachesthe pulmonologist and is lost resulting in their cancer never being found orbeing found too late.

Written in association with Optellum Ltd, is trying to address this is-sue. Their vision is to redefine cancer care from early diagnosis to treatment,by enabling every clinician to choose the optimal diagnostic and treatmentpathway for their patients. This is done by using machine learning on vastmedical image repositories. This thesis is part of this vision and focuses on:"the assessment of lung damages from CT images using machine learningmethods." It will focus on how to assess any kind of lung damages by us-ing deep learning methods whilst looking only at a global scale: the entireCT-scan. This work does not focus on the finding of nodules (local scale)as it has be done in [36, 18] but on the global assessment of lung damages(global scale). Then it will be possible to combine these two scales of featuresto better predict cancer.

2 2 MATERIALS AND METHODS

2 Materials and methods

2.1 Clinical Data

Machine Learning (ML) and even more Deep Learning (DL) require the useof large quantities of data. Data here are medical images and more particu-larly CT-Scans from the NLST dataset. NLST dataset is a screening trial inwhich around 50000 heavy smokers aged between 55 and 74 received eithera chest X-ray or a chest CT scan at three different time points if possible. Thisthesis has extracted CT Scans for 10 000 patients from this study to focus on.Counting the different time-points, a total number of 16164 images have beencollected.

The CT-scans come from a large number of sites as well as manufacturers.All the CT-scans have a 512x512 pixels dimensions in the axial plan but havea high variability in the resolution. (cf part 2.3)

The entire set of clinical data is split 70:30 into training and validation set.If a scan from a patient is chosen in one set, all the other CTs for this patientwould be included within the same set.

2.2 Image Annotations

Approximately 500 different metadata have been recorded for each scan inthe NLST trial. This data had been divided into categories like: demographic,lung cancer, smoking, death, or follow-up. Only a few of this metadata isinteresting in this master thesis. After a long analysis of the dictionaries sum-marizing them, three of them had been chosen plus one created by membersof the company:

2.2.1 Emphysema

Emphysema is defined as the "abnormal permanent enlargement of the airspacesdistal to the terminal bronchioles accompanied by destruction of the alveolarwall and without obvious fibrosis". A patient presenting emphysema wouldbe classified as 1 (fig 1a) [23].

2.2.2 Opacities

Opacities represent the result of a decrease in the ratio of gas to soft tissue(blood, lung parenchyma and stroma) in the lung (fig 1b).

2.2.3 Consolidation

Consolidation of the lung is a solidification of lung tissue due to liquid orsolid accumulation in the air spaces (fig 1c).

2 MATERIALS AND METHODS 3

2.2.4 Cancer

The cancer markup had been realized by trained doctors hired by the com-pany. Thanks to their knowledge, they had been able to differentiate cancernodules from benign nodules. A patient is marked as cancer as soon as hehas at least one cancer nodule in one of his available scans. For example, ifpatient X has no cancer on CT at time-point 0, but one cancer nodule at CT attime-point 1, both images of the patient will have the label cancer.

(a) Emphysema (b) Opacities (c) Consolidation

Figure 1: Visible diseases in the lung

All the metadata but cancer annotation had been created during the NLSTtrial and are subject to human errors. The proportion of positive patient de-pending on the factor is summarized in table 1 and fig 2. The proportion ofconsolidation is too low in the dataset to be used for training later. The dataare unbalanced. This problem was tackled during training phase by usingthe weighted sampler method in Pytorch. By giving a weight to each classit ensures that in each batch, there is 50:50 of images with and without thedisease. It results in showing more times each disease cases in average.

Class Positive Proportion

Cancer 940 5.9%Emphysema 5043 31.2%

Opacities 3783 23.4%Consolidation 124 0.8%

Table 1: Proportion of each factor in the NLST database


Figure 2: Distribution of NLST data

2.3 Preprocessing

2.3.1 Normalization

As previously noted, the CT-Scans vary a lot in term of resolution and inten-sity due to the variability of devices. In order to better generalize, a normal-ization step for all the images is necessary before using them for training.

ResamplingThe first normalization is to rescale the images to the same spatial resolution;1x1x2mm (x;y;z) for removing zoom or thickness variance. Indeed a scanmay have a spatial resolution of 2x2x2.5mm meaning that the distance be-tween slices is 2.5mm. The resampling has been completed using the nearestinterpolation.

StandardizationCT-scan intensity is measured in Hounsfield Unit (HU) and represents theradiodensity. The CT scans from our database range from -1024 to 1000. Theinteresting values for the lung images are around 0 as it represents water andaround -1000 as it represents air. As shown in figure 3a, these values are themost common in lung CT-Scan. Data in machine learning and deep learningare commonly standardized which means removing the mean value of thedataset and dividing by the standard deviation of the dataset. Values thenmostly range between -1 and 1. The mean and standard deviation values forthe NLST dataset can be seen in table 2. The example of distribution of HUbefore and after standardization can be found in figure 3a and 3b:

xstandard =x−meandataset

σdataset(1)


Mean dataset -440 HUStandard deviation dataset (σdataset) 480 HU

Table 2: Dataset mean and standard deviation

(a) Before (b) After

Figure 3: HU distribution before and after standardization

Fixed-sizeThe network has to be fed with the same inputs however after resampling,all the CT scans have a different number of pixel per slices and a differ-ent number of slices per scan. For example: 280x280 or 400x400 (x;y). Theadopted strategy is to use an input of 320x320x32 (x;y;z) which means 32slices of 320x320 pixels. In order to do this, images have either been croppedto 320x320 or a 0 padding had been added on the side to reach the 320x320input size. Using a random cropping could leave a tumor out of the lung, butin our study, this is not a problem. The main idea here is that the networkshould learn global pattern inside the lung and not local information as a tu-mor. So not having the tumor inside the image doesn’t change the patternof the lung in its globality, and then should not change what is the networklearning from the CT scan.

In order to reach the target of 32 slices several methods are computed andused in both the singular and multiple experiments:

• Choice of 32 slices in the CT with a pre-defined step in between 2 slices(typically 3)

• Random choice of 32 slices

• Minimum Intensity Projection (MinIP) over 3 slices: take the minimumvalue over the z-axis for 3 consecutive slices. This method emphasizesthe dark area and helps to detect emphysema

• Maximum Intensity Projection (MaxIP) over 3 slices: take the maximum


value over the z-axis for 3 consecutive slices. This method emphasizesthe light area and helps to detect nodules

• Average Intensity Projection (AIP) over 3 slices: takes the average valueover the z-axis for 3 consecutive slices.

2.3.2 Data Augmentation

In order to better generalize the results and avoid overfitting, data augmen-tation is used to increase the dataset. The first way is to apply a random cropto the images while fixing the size. By doing this, slightly different parts ofthe same image are shown which help the network learns further.

The other augmentation is a random flip over the x or y axis during train-ing meaning a reflection of the slices across the mid of the axis.

The last augmentation is a random rescale of the histogram intensity ofeach image. As shown in fig 3a, the normal distribution of an image in HU iswithin the range [-1000:600]. To rescale the image’s intensity, random mini-mum between -1150 and -850 is chosen and a random maximum between 100and 1300 and rescale the histogram from the initial range to the [min:max]range. The intensity of the image will change each time the network seesthe image. It is important as the intensity mainly depends on the machineand wrong correlation could be learned by the network. Indeed, in the NLSTdataset, patients had their CT in different hospitals, and the rate of Cancer orEmphysema can be really different from one hospital to another. Image inten-sity is a intrinsic parameter of the used machine depending on its calibration,the followed protocol and reconstruction filters used by the manufactureras shown in [20]. Then, without intensity augmentation, the network couldlearn simple correlation between emphysema cases and the average intensityof the scan.

2.4 Evaluation metrics

During this thesis, many models will be trained using different networks,training methods and hyperparameters (parameters which are set by handbefore the training such as learning rate, momentum, etc.). In order to com-pare all these models, evaluation metrics must be set before starting the train-ing. The metrics used to evaluate a model and compare two models are:

2.4.1 Loss and Accuracy

For both loss and accuracy, they will be computed for both training and val-idation set (cf Annex A). Loss function (also called error or cost functions)maps the network parameters to a scalar value which specify how wrong isthis set of parameters. The main task is then to minimize the loss function


by updating the network’s parameters. A loss function is computed dur-ing the forward pass of the network. Working on a classification problem,the cross-entropy loss function is used in all experiments (unless otherwisenoted). This loss function is a combination of the Log Soft Max function andthe Negative Log Likelihood function:

NLL(x, class) = −x[class] (2)

LSM(x, class) =log(exp(x[class])∑

j expx[j](3)

For the same experiment with different hyperparameters, the model reachingthe lower loss function on training and validation set can be defined as thebest model.

The accuracy is defined as the percentage of well classified elements in aclassification task. The higher the accuracy on the validation set, the betterthe model.

2.4.2 ROC and confusion matrix

The two other metrics are ROC (and AUC) and confusion matrix. These met-rics are defined in Annex A. In the ROC, the more the curve is to the topleft corner the better the model is. This can also be seen by comparing theArea Under the Curve (AUC). The higher the AUC, the better the model. Theconfusion matrix allows us to compute different metrics: accuracy, precision,recall, etc. These are important to understand what the network mis-classifyand compare with other models.

The decision to keep one model rather than another are made based onthese evaluation metrics.

2.5 Experiments

2.5.1 Single factor prediction

The first set of experiments is to determine the ability of the network to learndifferent disease factors: cancer, emphysema and opacities. These are trainedseparately but use the same network.

2.5.1.1 NetworkThe base architecture used to train the different factors is a 3D version ofResNet18 which will be called ResNet3D 5 [12]. The input is a set of 32 slicesor projection of slices having 320x320 pixels. Firstly, a 3D convolution witha 3x5x5 kernel and 32 channels. The large filter is used to detect larger com-ponent in the image (shapes, blobs, etc.). After a 3D batch normalization and


max pooling, the network is composed of four similar blocks (cf fig 4). Foreach block, the input is used twice, first in the succession of layers (left partof fig 4) but also added to the output of this succession (right part of fig 4).Then the output of one block feeds the input of two successive layers.

The classifier part of the network, which aims to determine if the factoris found or not, is composed of a 3D convolution with a 1x1x1 kernel and a3D spatial average pooling. Through this convolution, the 256 channels aremapped to a 2 classes output: whether they have the disease or not.

Figure 4: ResNet 3D block: The input of a 3D block is first passing throughthe succession of convolution 3D, BN, ReLu activation, convolution 3D andBN (left part of the diagram) but also is added to the output of the last BN(right part).

2.5.1.2 TrainingImplementation is done using Pytorch framework [27] and the network istrained using NVIDIA GPU. The training is done during 40 epochs, meaningiterating 40 times over the dataset. A batch size of 16 (maximum fitting thememory), a momentum of 0.99, a weight decay of 1e−4 and an initial learning


Figure 5: ResNet 3D used for cancer prediction for example

rate of 0.01. The learning rate decay is done with the plateau method. Ifthe validation loss does not decrease during 3 consecutive epochs, the actuallearning rate is divided by a factor 10. The learning rate (LR) is defined as theamount of change applied to the model. LR decay is used in order to be moreand more accurate and reach a global minima.

Cancer:The cancer training is done by using the cancer metadata. The input is avolume of 32 slices of 320x320 pixels. The average intensity projection is usedto get the 32 slices. As only 5.9% of the images are marked as cancer, theclasses are balanced during the training using a sampler.

All the described pre-processing and data augmentation steps are per-formed.

The goal is to see the ability of the network to detect cancer from a CTvolume.

EmphysemaThe emphysema training is similar to the cancer one. The emphysema meta-data is used as output. The minimum intensity projection is used as it revealsdarker part of the image which correspond to emphysema. All the describedpreprocessing and data augmentation steps are performed. The goal is to seethe ability of the network to detect emphysema from a CT volume.

OpacitiesThe opacity training is the same as emphysema and cancer but with the opac-ity metadata as output. The maximum intensity projection is this time used,as opacity is seen as a brighter area on the CT, the MaxIP emphasizes thepresence of opacity. All the described preprocessing and data augmentationsteps are performed. The goal is to see the ability of the network to detectopacity from a CT volume.


2.5.2 Combination of factors

Figure 6: Multi-task Network

Once training is computed and the ability of the network is understood,the goal is then to combine all these factors detection in the same network.The next experiments are computed with emphysema, cancer, and opacitiesannotations and try to predict the three of them with only one multi-tasknetwork. First, the network is trained from no prior knowledge and then aself-supervised learning is used before fine-tuning the network.

2.5.2.1 NetworkFor this follow-up prediction, a multi-task network shown in fig 6 is used.The network is separated in two parts. First the feature extraction part isthe ResNet 3D that was already used for the single factor prediction. Thenumber of channels is the same at each layer, only the last classifier part isremoved. The second part of the network is the factor predictions. In thispart, the output of the ResNet 3D is divided in three subparts for the threepredictors. During the training, a loss is computed for each of the classifierand backward through their own classifier and the ResNet 3D.

In total three losses are computed, one for each of the factor classifier. Thebackward phase (update of the weights by gradient descent) is then com-puted according to the three losses.

2.5.2.2 Full trainingThe full training is done while initializing the convolution weight with gaus-sian distribution centered in zero and with a standard deviation of

√2/n


where n is the number of inputs to the neuron. The batch normalizationlayers are initialized with a weight of 1 while of the bias are set to 0. Thetraining is done during 50 epochs with a batch size of 16, a momentum of0.99, a weight decay of 1e−4 and a initial learning rate of 0.01. The learningrate is lowered using the plateau function.

2.5.2.3 Self-supervision

Figure 7: Siamese Network

Another approach for training this complex multitask network is to firstpre-train the feature extractor on a specific task before using transfer learningand fine-tuning it on the multi-task. Different methods exist to pre-trained anetwork, self-supervision one is used in order to force the network to learnglobal features and not only specific to the main task. This method useSiamese network and have been first used for face recognition [6].

Siamese networkHere, a Siamese network is trained to distinguish between pairs of imagesfrom the same patient at two different time points or from two different pa-tients. The Siamese network is then two ResNet 3D networks in parallelsharing the same weights (see fig 7). The input is a pair of images whichgo through the ResNet 3D to compute a 256 vector. Vectors from the two in-puts are then used to compute the contrastive loss (cf next paragraph). Thedistance between two images is computed as the absolute difference betweenthe two output vectors.

Training of Siamese network


To compute the pairs of images, only patients with 2 CT-scan time points ormore are kept in the NLST database. It corresponds to 5,085 patients or pairsof images. Once a patient is chosen as an input, a random choice is made todetermine if the other image should come from the same patient or a differentone. If different, a random choice of a CT Scan among all the remaining CTsis made. All the data augmentation is performed. The random scale of theimage’s intensity is really important for the network to not learn specificitiesof the machine whilst differentiating similar from different patients.

In order to improve the results of the Siamese network, the adaptive mar-gin loss function described by Wang et al. [32] is chosen. Normally, Siamesenetwork are trained with the contrastive loss function [6] described in equa-tion (4).

(1− label)12D2w + label × 1

2{max(m−Dw, 0)}2 (4)

Where Dw represents the euclidean distance between the two output vectorsandm a defined margin. The main issue with this method is finding the rightmargin, this is why the adaptive margin loss function is chosen as it dependson the inputs. In a batch, all the distances between same patient must besmaller than an adaptive up-margin Mp and all the distance between the dif-ferent patients must be larger than an adaptive down-margin Mn defined inequation (5).

Mp =1

µ(1− exp(−µd)),

Mn =1

γlog(1 + exp(γs))

(5)

Where s is the mean similar distance, d the mean different distance and µ andγ two hyperparameters set respectively to 8 and 2.

From equation (5), Mτ and Mc are defined as Mp = Mτ −Mc and Mn =

Mτ +Mc. Finally the loss function is defined (6) with label ∈ {−1; 1}.

Loss =∑batch

max{Mc − label(Mτ −Dw), 0} (6)

2.5.2.4 Training with Pre-trainingThe training is the same as in part 2.5.2.2, the only difference is in the initial-ization of the weights. In this case of transfer learning (see A.2.3), the weightsfrom the pre-trained network are used in the multi-task network before train-ing following which fine-tuning of the entire network is performed.

3 RESULTS 13

3 Results

The following sections will present the results obtained for the different setsof experiments, starting by the single factor predictor and then moving tomulti-task prediction.

3.1 Single factor prediction

3.1.1 Cancer

Figure 8 represents the loss and accuracy functions for the training and val-idation set. The training loss decreases while the validation one is first un-stable before remaining close to a constant (after 10 epochs). The ROC andprediction matrix in fig 9 show that the network hasn’t learned well the Can-cer prediction as the the ROC is around the random prediction (AUC = 0.51).The prediction matrix shows an inability to classify correctly cancer cases:most of the cases are classified as benign. The unbalanced validation set giveus a quite high validation accuracy: approximately 75%, but this figure mustbe analyzed knowing that the network mainly classifies as benign and themain class is benign.

(a) Loss ResNet 3D (b) Accuracy ResNet 3D

Figure 8: Loss and Accuracy evolution for training and validation phase, Cancerprediction

3.1.2 Emphysema

For the second experiment, the prediction of emphysema, the ROC and pre-diction matrix show that the network is able to learn few aspects of emphy-sema. Indeed the AUC is 0.78, which is much improved than random. Theconfusion matrix (fig 11b) shows that presence and absence of emphysemaare mainly classified accurately. In this case, the validation accuracy of 72%is meaningful as both classes are mostly classified correctly. Moreover, thetraining and validation loss curves (fig 10a are as expected from a network

14 3 RESULTS

(a) ROC Cancer

Prediction No Prediction Yes

Actual No 3470 983Actual Yes 214 139

(b) Confusion matrix Cancer

Figure 9: ROC Curve and Confusion Matrix for Cancer prediction

which has learned something. Then, emphysema can be classified accuratelyby the network from a CT volume.


Figure 10: Loss and Accuracy evolution for training and validation phase,Emphysema prediction

3.1.3 Opacities

Concerning the opacity prediction, the confusion matrix (fig 13b) shows thenetwork is reasonably accurate at recognizing opacities when present but hasmore troubles to deal with the absence of opacities: many more false positivesthan false negatives. The number of false positives is too high to concludesignificance and thus it cannot be assumed that the network has learned.

3.2 Combining factors

Combining factors is first done by training the network from random initial-ization. The ROC curve and histogram distributions for emphysema, cancerand opacities in fig 17 and 18 (left column) show that the network does not

3 RESULTS 15

(a) ROC Emphysema

Pred No Pred Yes


(b) Confusion matrix Emphysema

Figure 11: ROC Curve and Confusion Matrix for Emphysema prediction


Figure 12: Loss and Accuracy evolution for training and validation phase,Opacities prediction

(a) ROC Opacities

Pred No Pred Yes


(b) Confusion matrix Opacities

Figure 13: ROC Curve and Confusion Matrix for Opacities prediction

really learn. There is no separation between classes and the histogram distri-bution look like a Gaussian centered on 0.5 meaning the network has become

16 3 RESULTS

confused for most of the cases.

(a) ROC Siamese

Pred No Pred Yes


(b) Confusion matrix Siamese

Figure 14: ROC Curve and Confusion Matrix for self-supervision

For the self-supervision task, the maximum validation accuracy is 89.25%while the AUC is 0.96. In order to determine if the network classify accuratelytwo patients, the distance between the two output vectors of the network iscomputed (fig 16a). The best threshold is found in fig 16b: 89.25% of accuracyfor a threshold of 0.0448. If the distance is lower than the threshold, imagesare classified as a same pair otherwise as a different pair. The two classes arewell separated on the distance graph 16a

The filters of the first convolution are computed (fig 15) (which corre-sponds to the weights of this layer), they show that the network has learnedsome patterns as shapes in the filters are distinguishable. More important,the filters have changed significantly from the random initialization and thenit can be concluded that the network is now able to recognize specific shapesbecause of these filters.

(a) Random initialization of the filters (b) Filters after training

Figure 15: 32 filters of the first convolutional layer

3 RESULTS 17

(a) Distance distribution for validationset for similar and different patients’

images

(b) Evolution of validation accuracy vsdistance threshold

Figure 16: Siamese network, validation set: distance and threshold

3.2.1 Training with Pre-training

When adding a pre-training to the network, it can be noted that the networkis better at classifying the different diseases, especially for emphysema wherean AUC = 0.78 is reached (fig 17). Moreover, the histograms representing thedistribution of probability of having the disease for each class are not gaus-sian more or less centered in 0.5 anymore. It can be seen for the emphysemathat the two class are well separated (fig 18).

3.2.2 Risk score

Now the final task is to link all these results to a possible clinical application.From the different histogram distributions, a risk score for each disease canbe created. For a given probability of having the disease, the risk score is theproportion of patients having the disease at this probability. The differentrisk scores are displayed in fig 19 and the formula is:

rs(prob = p) =disease(prob)

total(prob)(7)

18 3 RESULTS

(a) ROC Cancer (b) ROC Cancer Pre Trained

(c) ROC Emphysema (d) ROC Emphysema Pre Trained

(e) ROC Opacities (f) ROC Opacities Pre Trained

Figure 17: ROC curves: left column shows without pre-training, right columnshows with pre-training

3 RESULTS 19

(a) Cancer (b) Cancer Pre Trained

(c) Emphysema (d) Emphysema Pre Trained

(e) Opacities (f) Opacities Pre Trained

Figure 18: Distribution of probability of having the disease for each class: leftcolumn shows without pre-training, right column shows with pre-training

20 3 RESULTS

Figure 19: Risk score curve for each disease

4 DISCUSSION 21

4 Discussion

4.1 Main findings

Throughout this thesis it is shown first that there is valuable information in awhole CT scan. Indeed, from the single factor predictor, it can be noted thatit is possible to assess emphysema or opacities better than chance even if thetask is not fully learned. However, in the case of cancer, the network doesnot learn any features which enable it to predict cancer. In the case of emphy-sema, which has been seen to be successful, without having any knowledgeof what is emphysema, the network is able to diagnose emphysema casescorrectly. The task of detecting cancer by training with a full CT-Scan seemsto be too complicated for the network. It can be explained by the fact thatcancer is nowadays detected by nodules, and there is no real pattern insidethe lung helping to predict it. Another reason could be that the used datasetis a high lung cancer risk population as they were all heavy smokers. Theemphysema detection is much better because the task is more visual and apattern clearly exist in the whole lung.The second aim of this thesis is to predict the three diseases with a uniquenetwork, based on the assumption that the features for detecting cancer, em-physema or opacities should be the same. By training from no prior knowl-edge, it can be noted that the results are not comparable in terms of ROC tothe ones obtained with single factor prediction. However, by using the self-supervision as pre-training, the results are improved and ROCs get closer tothe ones obtained by training one by one.

Here two results are important. First, the impact of pre-training a net-work and using a transfer learning approach. In our case, the filters obtainedwith the pre-training fig 15b show pattern as horizontal and vertical slices orblobs which are well known pattern for finding edges, shapes, or texture inan image. The difference of results between non pre-trained and pre-trainednetwork is shown in fig 17 and 18. AUC are increasing for emphysema andopacities and the histogram distributions clearly show the impact of pre-training. In emphysema, it enables the clear separation of classes. Moreover,pre-training a network ensures that it is learning visual features and not onlymachine related features as the resolution or the intensity of the CT-Scan.

The second result is that it is possible to use a multi-task network in orderto learn multiple predictions at the same time. Indeed, whilst using the multi-task network, similar results as the single factor prediction are achieved. Re-sult is slightly better in the case of emphysema: AUC = 0.79 in single predic-tion and AUC = 0.78 in multi-task prediction. This demonstrates the idea thatcombining learning can help a network find the most interesting features andhave better results on some tasks. Whereas, whilst learning only one predic-

22 4 DISCUSSION

tion, the network can become too specific and learn only local features.Finally, from these experiments, risk score for each disease can be calcu-

lated. In order to predict a global risk score, further clinical analysis is re-quired. A simple way could just be to average the different scores in order tohave the global one.

4.2 General impact

Nowadays, images are essential in the detection of lung cancer. However,the use of this images is still empirical as doctors only look at them through adisplay software. Much research exists on the automatic detection of nodulesor classification of nodules but none of them try to assess the informationcontained in the whole Ct-Scan. In this thesis, it was demonstrated despitehaving lots of limitations, that it is possible to find information by only look-ing at an upper scale (global CT-scan), and not a local scale or specific areaor feature. This information could be a great help for doctors in the future aswith further development, it might be possible to automatically give a riskscore for each patient going through a CT-scan exam, and then draw doctorsattention on patients at risk.

The work done in this thesis could also be used in order to curate bigdatabase. For example if for a retrospective study, 2000 emphysema patientsmust be found, by taking the 2000 with the highest probability of havingemphysema, a researcher will save a lot of time comparing to choosing themby hand.

Nowadays, pneumologist use a BROCK score to evaluate the lung cancerrisk [22]. This score includes risk factor such as Emphysema or presence ofnodules. By combining this work, with a local features work on nodulesdetection, it might be possible to calculate the the BROCK score only fromCT-scan.

4.3 Comparison to other methods

The main use of 3D chest CT is to do lung nodules detection as it has beendone in the Kaggle Challenge with the LUNA dataset or the Luna Challenge[34, 9]. They reach score up to 0.951 (FROC) which is a derivative of the ROCcurve. This score is way better than the AUC obtained with the global featureapproach. However, these methods focus on the specific task of nodule de-tection using small patches (with a nodule or just background) in the trainingphase while our method uses the full CT-Scan during the training. Moreover,detection in deep learning is a famous and well-known task with a lot of dif-ferent approaches. These two methods could then be complementary and theoutput of both combine in order to better predict lung cancer.

4 DISCUSSION 23

A study closer to what is done in this thesis is using CheXNet [28]. In thisstudy, they use transfer learning on a 2D network to predict 14 different lungdiseases on 2D X-ray images. Even if some concerns arise about this datasetand this study [24], the presented results are good. They reach an AUC of0.92 for emphysema and 0.78 for nodule. These results can be explained bythe used of a pretrained network: DenseNet, pretrained on ImageNet. In thisthesis, the input are 3D volume and no 3D pretrained network have beenreleased yet.

4.4 Limitations

Several limitations apply to this project and most of them are due to the avail-ability of the data and the consistence of them.

First of all, all the training have been done on the NLST dataset, which hasan inclusion criteria of having more than 30 pack years as smoking history.This of course bias the training of the network and impact its robustness. Thisissue raise the problem of having clinical data for healthy patients. Indeed,these data would considerably improve all the artificial intelligence applica-tions in the medical field. Indeed, in our case, some patients are not diag-nosed of cancer for example, but their lung have many damages due to theirsmoking history, and learning this patient as a healthy one (absence cancer)for the network might confuse it.

The second limitation is the noise of the metadata. Indeed, while analyz-ing the results, a lot of mistakes in the markup of emphysema or opacitieswere found, confusing for sure the network in his learning process. A greatthing is that the emphysema prediction was sometimes better than the hu-man markup. Indeed, for most of the cases with a really high probability ofhaving emphysema whilst the human markup says no emphysema, it is re-alized after checking that the patient indeed has emphysema as shown in fig20.

A more technical limitation is the gpu memory. 3D scan are heavy filesand take a lot of gpu memory. A direct consequence is a small size of batchsize used for the training: 16 comparing to most of the other deep learningtraining which use a batch size of 128. This could influence the training asthe backpropagation has more chance to be sensible to the specificities whenthe batch is small.

4.5 Future work

From this promising project, a lot of new work can derive. First of all, work-ing on the same project but with different dataset would be the first thing todo to validate the work done. Validity of metadata will have to be ensuredwhile collecting them.

24 5 CONCLUSION

Figure 20: False Positive Emphysema

Another idea would be to work on new metadata. Many diseases or infec-tious pattern can appear in the lung and training the network on all of themat the same time would give a more robust network and then better result.

In a more technical approach, the self-supervision training can certainlybe improved by using another task or another loss function as the histogramloss. Testing new networks would also be something interesting or combin-ing networks. For example working on a texture network could be interest-ing.

Finally, it is also possible to apply this idea of working on a full scanneron other part of the body. For example breast cancer or liver cancer couldfind an utility in such a damage score.

5 Conclusion

Different approaches to assess lung damages have been explored in this thesisfor a large dataset of patients.

First, the single predictor approach show that a network is able to learnsome disease as emphysema or opacities with a certain degree of confidenceby only looking at a global scale.

Moreover, the work done here shows that it is possible to predict multi-ple diseases by using only one network, and then that the same features arenecessary to detect different diseases.

This thesis also emphasizes the importance of pre-training. The self-supervisionmethod used here enabled the network to be initialized with visual featuresuseful for the multi-task learning. The importance of transfer learning hasbeen shown as the results are better when pre-training and fine-tuning.

Even if the work as limitations as a bias database, noisy metadata, theoverall show that it is indeed possible to retrieve useful information by look-

5 CONCLUSION 25

ing at a full CT-Scan, and that deep neural network are able to learn at a largescale. The results are encouraging and this work can lead to many other ex-periments on different kind or location of images.

26 REFERENCES

References

[1] I. GLOBOCAN 2012. Data visualization tools that present current nationalestimates of cancer incidence, mortality, and prevalence. 2016. URL: http://gco.iarc.fr/today/online-analysis-multi-%20bars?mode=cancer&mode_population=continents&population=900&sex=0&cancer=29&type=0&%20statistic=0&prevalence=0&color_palette=default.

[2] Y Bengio. “Learning deep architectures for ai. Foundations and TrendsR in Machine Learning, 2 (1): 1–127, 2009”. In: Cited on (), p. 39.

[3] Isabel Bush. Lung nodule detection and classification. Tech. rep. Technicalreport, Stanford Computer Science, 2016.

[4] MEJ Callister et al. “British Thoracic Society guidelines for the investi-gation and management of pulmonary nodules: accredited by NICE”.In: Thorax 70.Suppl 2 (2015), pp. ii1–ii54.

[5] Wanqing Chen et al. “Cancer statistics in China, 2015”. In: CA: a cancerjournal for clinicians 66.2 (2016), pp. 115–132.

[6] Sumit Chopra, Raia Hadsell, and Yann LeCun. “Learning a similaritymetric discriminatively, with application to face verification”. In: Com-puter Vision and Pattern Recognition, 2005. CVPR 2005. IEEE ComputerSociety Conference on. Vol. 1. IEEE. 2005, pp. 539–546.

[7] Ciro Donalek. “Supervised and Unsupervised learning”. In: AstronomyColloquia. USA. 2011.

[8] Jacques Ferlay et al. “Cancer incidence and mortality worldwide: sources,methods and major patterns in GLOBOCAN 2012”. In: Internationaljournal of cancer 136.5 (2015).

[9] grt123. URL: https://github.com/lfz/DSB2017/blob/master/solution-grt123-team.pdf.

[10] Duc M Ha and Peter J Mazzone. “Pulmonary Nodules”. In: Age 30(2014), pp. –05.

[11] Mohammad Havaei et al. “Brain tumor segmentation with deep neuralnetworks”. In: Medical image analysis 35 (2017), pp. 18–31.

[12] Kaiming He et al. “Deep residual learning for image recognition”. In:Proceedings of the IEEE conference on computer vision and pattern recogni-tion. 2016, pp. 770–778.

[13] RT Heelan et al. “Non-small-cell lung cancer: results of the New Yorkscreening program.” In: Radiology 151.2 (1984), pp. 289–293.

REFERENCES 27

[14] National Cancer Institute. Cancer Statistics. URL: https://www.cancer.gov/about-cancer/understanding/statistics.

[15] Shingo Iwano et al. “Computer-aided diagnosis: a shape classificationof pulmonary nodules imaged by high-resolution CT”. In: ComputerizedMedical Imaging and Graphics 29.7 (2005), pp. 565–570.

[16] Michael T Jaklitsch et al. “The American Association for Thoracic Surgeryguidelines for lung cancer screening using low-dose computed tomog-raphy scans for lung cancer survivors and other high-risk groups”. In:The Journal of thoracic and cardiovascular surgery 144.1 (2012), pp. 33–38.

[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenetclassification with deep convolutional neural networks”. In: Advancesin neural information processing systems. 2012, pp. 1097–1105.

[18] Devinder Kumar, Alexander Wong, and David A Clausi. “Lung noduleclassification using deep features in CT images”. In: Computer and RobotVision (CRV), 2015 12th Conference on. IEEE. 2015, pp. 133–138.

[19] Geert Litjens et al. “A survey on deep learning in medical image anal-ysis”. In: arXiv preprint arXiv:1702.05747 (2017).

[20] Dennis Mackin et al. “Measuring CT scanner variability of radiomicsfeatures”. In: Investigative radiology 50.11 (2015), p. 757.

[21] Heber MacMahon et al. “Guidelines for management of incidental pul-monary nodules detected on CT images: from the Fleischner Society2017”. In: Radiology (2017), p. 161659.

[22] Annette McWilliams et al. “Probability of cancer in pulmonary nod-ules detected on first screening CT”. In: New England Journal of Medicine369.10 (2013), pp. 910–919.

[23] KM Venkat Narayan et al. “Report of a national heart, lung, and bloodinstitute workshop: heterogeneity in cardiometabolic risk in asian amer-icans in the US”. In: Journal of the American College of Cardiology 55.10(2010), pp. 966–973.

[24] Luke Oakden-Rayner. CheXNet: an in-depth review. URL: https://lukeoakdenrayner.wordpress.com/2018/01/24/chexnet-an-in-depth-review/.

[25] Maxime Oquab et al. “Learning and transferring mid-level image rep-resentations using convolutional neural networks”. In: Proceedings ofthe IEEE conference on computer vision and pattern recognition. 2014, pp. 1717–1724.

[26] Sinno Jialin Pan and Qiang Yang. “A survey on transfer learning”. In:IEEE Transactions on knowledge and data engineering 22.10 (2010), pp. 1345–1359.

28 REFERENCES

[27] Adam Paszke et al. “Automatic differentiation in PyTorch”. In: (2017).

[28] Pranav Rajpurkar et al. “CheXNet: Radiologist-Level Pneumonia De-tection on Chest X-Rays with Deep Learning”. In: arXiv preprint arXiv:1711.05225(2017).

[29] Ali Sharif Razavian et al. “CNN features off-the-shelf: an astoundingbaseline for recognition”. In: Proceedings of the IEEE conference on com-puter vision and pattern recognition workshops. 2014, pp. 806–813.

[30] Karen Simonyan and Andrew Zisserman. “Very deep convolutionalnetworks for large-scale image recognition”. In: arXiv preprint arXiv:1409.1556(2014).

[31] Standford Univeristy. Transfer Learning. URL: http://cs231n.github.io/transfer-learning/.

[32] Jiayun Wang et al. “Deep ranking model by large adaptive marginlearning for person re-identification”. In: Pattern Recognition 74 (2018),pp. 241–252.

[33] WL Watson and AJ Conte. “Lung cancer and smoking”. In: The Ameri-can Journal of Surgery 89.2 (1955), pp. 447–456.

[34] Julian de Wit. URL: http://juliandewit.github.io/kaggle-ndsb2017/.

[35] Jason Yosinski et al. “How transferable are features in deep neural net-works?” In: Advances in neural information processing systems. 2014, pp. 3320–3328.

[36] Wentao Zhu et al. “DeepLung: 3D deep convolutional nets for auto-mated pulmonary nodule detection and classification”. In: arXiv preprintarXiv:1709.05538 (2017).

A STATE OF THE ART 29

A State of the Art

A.1 Clinical Background

A.1.1 Lung Cancer

According to the Globocan series published in 2012, lung cancer is the mostcommonly diagnosed cancer in the world with around 1.82 million new caseseach year. In 2012, 1.6 million people died from lung cancer which represents19.4% of cancer’s deaths. Lung cancer’s incidence is higher in developedcountries from America, Europe and Asia. [5, 8] Moreover, significant up-ward trends are visible in these countries and especially for Asian females[5]

Lung cancer is mainly related to smoking history [33], but not only as thenumber of lung cancer in Asian female population is increasing while thispopulation has a really small smoking background [5].

Figure 21: Number of incident cases worldwide in 2014 [1]

As every cancer, the lung cancer appeared when abnormal cells grow, inone or both lungs for later forming what is called a tumor. A tumor can bebenign or malignant.

Lung cancer is divided into two types:

• Primary lung cancer: cancer which originates in the lung and can also bedivided into two main types of cancer

– Non-small-cell lung cancer (NSCLC): 80% of the cases and his di-vided into 4 sub-categories (squamous cell, adenocarcinoma, bron-chioalveolar carcinoma, large-cell undifferentiated carcinoma)

– Small-cell lung cancer (SCLC): 20% of the cases and are small cellswhich multiply quickly.

30 A STATE OF THE ART

• Secondary lung cancer: cancer which starts in another part of the bodyand metastasizes to the lung

Lung cancer is mainly discovered from incidental findings (cardiovascu-lar computed tomography scan (CT), liver CT) and from screening programlike in the US [16]. CT images are a series of X-Ray images taken from manydifferent rotations, producing cross-sectional images and computed thanksto a computer. The use of digital geometry processing allows creating 3Dvolume from series of 2D images. This is an expensive technology whichprovides detailed information about body structures and lung structure forchest CT-Scan.

According to the guidelines of this screening programs, a chest low dosecomputed tomography (LDCT) must be performed and from which nodules,benign or malignant, can be revealed and a decision made according to thediagnosis.

A.1.2 Nodules

A pulmonary nodule is a small, round or egg-shaped lesion in the lungswhich results in a radiographic opacity [10]. They are considered to be lessthan 30mm. They are now differentiated into three main categories: solidnodules, part-solid nodules, and pure ground grass nodules [4, 21] (Fig 2).Depending on the category, the guidelines for management of pulmonarymodules vary [4, 21]. Figure 3 summarises the guideline from the FleischnerSociety.

Sub-solid nodules have a higher likelihood of malignancy [22]. However,many factors increasing the risk of malignancy exists and they will be usefulto keep in mind while choosing and training the network.

(a) Solid Nodule (b) Part Solid Nodule (c) Ground Grass Opacity Nodule

Figure 22: Types of nodule


Nodule Type CharacteristicsMalignacy [22]

Benign Malignant

Solid Obscures the underlying bronchiovascular structure. 98.9% 1.1%

Ground GrassOpacification is greater than that of the background

98.1% 1.9%but through which the underlying vascular structure is visible

Part Solid Mix of the two previous types of nodules 93.4% 6.6%

Table 3: Nodule type and malignancy

Figure 23: Fleischner society 2017 Guidelines for Management of IncidentallyPulmonary Nodules in Adults [21]

A.1.3 Risks factors for nodule malignancy

The assessment of nodule malignancy is a true challenge nowadays to per-form better prevention of lung cancer. Indeed, the sooner a nodule is detectedas malignant, the better the treatment will be. Many risk factors for malig-nancy had been reported in the literature, which helps nowadays the doctors


to determine which nodule management to follow:

A.1.3.1 Nodule size [21]The main risk factor is the size of the nodule. Nodule sizes are divided intothree categories: <6mm (<100mm 3 ), 6-8 mm (100 – 250mm 3 ) and > 8mm(> 250mm 3 ). The smaller nodules are more likely to be benign and don’trequire any follow-up in most of the cases whereas the biggest ones require aclose follow-up (3 to 6 months).

A.1.3.2 Nodule Morphology [15]Spiculated nodules are associated with malignancy for many years [21] andare then classified as high-risk nodules.

Figure 24: Different nodules’ morphologies

A.1.3.3 Nodule Location [21]Upper-lobe nodule location is a high-risk factor. [22]

A.1.3.4 Multiplicity [21]High multiplicity is a low-risk factor. The presence of 5 or more noduleslikely results from an infection and then these nodules are benign. Havingbetween 1 and 4 nodules increase the risk of malignancy.

A.1.3.5 Growth Rate [21]The growth rate is estimated by the Volume Doubling Time (VDT) whichcorresponds to the number of days in which the nodule doubles in volume.A VDT < 400 days is a high-risk factor.

A.1.3.6 Age, Sex, Race [21]Lung cancer is really unusual before 40 years old. However, lung cancer in-cidence increases for each added decade. Women are more likely to develop


lung cancer than men and the incidence of lung cancer is much higher inblack population than white population.

A.1.3.7 Tobacco [21]A smoking history increases the risk from 10 to 35 times of having lung cancercompared to non-smokers.

A.1.4 Problem of detection

Pulmonary nodules are now detected by radiologists by considering theirshape, size and brightness of the unknown mass in the lung. Studies haveshown that only 68% of the nodules are found with this visual human detec-tion [13]. The early-classification of the nodules is remaining a challenge inorder to reduce the aggressiveness of the follow-up and treatment of patients.Computer-aided detection (CAD) has then a large role to play in nodule de-tection and is a high topic of interest. [3, 18] Especially the new deep learningarchitectures are promising since their appearance less than 10 years ago andwill be the main topic of my master thesis. Now we will focus more on theengineer approach of the problem by first presenting briefly what is deeplearning.

A.2 Engineering Background

A.2.1 Deep Learning

As described previously, new technologies in medicine have a more and moreimportant role to play. Machine learning, one of these new technologies, isa branch of artificial intelligence in which the system has the ability to learnand improve by itself from experience. In a simple definition, machine learn-ing uses algorithms to parse data, learn and output a prediction of a partic-ular task. Machine learning consists of many sub-categories as decision treelearning, rule-based machine learning or deep learning.


Figure 25: Example of neural network with one hidden layer (in the center)

Deep learning is then a new field of machine learning based on artificialneural network but using deeper architecture (Fig 5). A network is composedof nodes, which are linked together by weights (REFER TO FIG). When send-ing an input, only a few nodes will fire in order to produce an output. Thegoal is to adapt the weights by changing their value in order to get the rightnodes firing. Deep learning was first inspired by the functioning and thestructure of human brain and how the information is delivered from one neu-ron to another. The advantage of deep learning is that each layer producesa certain representation of the input data which are also used as input forthe next level of representation. Then it is possible, passing through a lot ofdifferent layers to combine all these representations in order to perform anykind of tasks depending on the chosen network. For example in the medicalworld, deep learning architectures are now used to perform [2]: image, ob-ject or lesions classification [19], detection, segmentation [11], registration orimage generation and enhancement. Moreover, deep learning is a hot topicin the medical field with an enormous increase in the number of papers pub-lished within the two last year [2]. (REFER TO FIG 6)

A.2.2 How does a deep learning network learn?

A deep learning network must learn by itself and only from the input datathat a human shows it. The goal is to eliminate all the biased knowledge thatwe normally include in the design of our algorithms. A deep learning net-work is then a black box, on which we can only control hyperparameters. Thelearning of a network is separated in three phase: the training, the validationand the testing as the images.


Figure 26: Example of Deep Learning architecture: GoogleNet

Figure 27: 6 a) Nb of papers including machine learning techniques (CNN =Convolutional neural network) b)Nb of paper depending on the task

During the training phase, the weights and bias are updated at each stepin order to reach the best model. The learning can be supervised or unsuper-vised [7]. Supervised learning includes labels, which represent the desiredoutput for an input. Thus, every time we show a new input to the network(a CT image in our case), we also provide it which output it should return(benign or malignant in our case) and then, thanks to specific optimizer, theweights and bias are automatically updated to reach the best performance.For example, in the case of nodules classification, every time we show a newnodule to the network, we have to provide it the desired output which willbe the type of this nodule (solid, sub-solid, ground grass). In unsupervisedlearning, no labels are provided, and then the network makes its own deci-sion about how to classify the data.In the validation phase, the goal is to assess the performance of the networkby showing it input data that it has never seen before. We can then assesshow well the network behaves with new and unknown data. It allows engi-neers to find the best model.


The testing phase is the last phase, once the best model had been chosenthrough the validation phase, to evaluate the general performance of themodel.

A.2.3 Transfer learning

As we said, deep learning enables us to work on tasks which are computedfrom the database rather than imposed from a model or selected by a human.Such techniques offer the advantage of selecting optimal features for the taskand enable a higher number of degrees of freedom for the classifier than amodel would ever do, but the training of such systems and the managementof a large number of degrees of freedom becomes a challenge.Then, transfer learning is a new method more and more used in the field. Themain idea is that the features learned from one system are used and adaptedto another system. It enables better convergence of the system for complextasks or tasks where the number of available data is too low [26, 25, 29, 35].For example, it is possible to use the features of a network which had learnedto classify goldfish, giant schnauzer, tiger cat, ... (AlexNet trained with Im-agenet [17]) in order to classify medical images, as many of the features areshared by every kind of images (for instance edges). Using transfer learn-ing enables to save a lot of computational time comparing to train a networkfrom no prior knowledge and initialized with random weights.Several methods exist in order to adapt these existing models to a specificapplication [31]

A.2.3.1 Feature ExtractorIt consists of removing the last fully-connected layer of a pre-trained network(CITE FIGURE). Pre-trained network means that we keep the weights learnedfrom a previous training made with a set of general images. The remainingpart of the network is then considered as a feature extractor. The last fully-connected layer is replaced by a linear classifier and train with the specific setof images. The network learns how to classify specific images (for instancesmedical images) based on features learned from general images.

A.2.3.2 Fine-tuningThe second approach consists in replacing a larger part of the existing andpre-trained network and to fine-tune the weights (CITE FIGURE). This isdone with backpropagation on the number of replaced layers. It is possibleto fine-tune the entire network. It will then take more computational time butwill remain shorter than training from no prior knowledge as the weights arenot distributed randomly. This method is based on the principle that a net-work becomes more and more specific with the layers, then if we retrained a


sufficient number of layer, it is possible to erase this specificity learned froma previous dataset and train it to the new dataset.

A.2.4 Challenges with deep learning

Deep learning is a promising and powerful tool, but the use and comprehen-sion of it remain tricky. An engineer has to face many challenges in orderto get results from these networks. The following challenges are the mainchallenges that I will have to face during my master thesis, and that everyengineer need to think about while working with deep learning.

A.2.4.1 Architecture selection in transfer learningSince the revolution of deep learning in 2012 with the creation of AlexNetnetwork which won the ImageNet competition (classification challenge over1000 classes), numerous and various network architectures have been cre-ated. Each model has its own specificities and thus advantages and draw-backs. A good comprehension of them enables a user to choose the right onedepending on the achieved task. Some comparison had been done on thesame dataset as seen in figure 8. Here are some of the well-known architec-tures broadly used:


Figure 28: a) AlexNet network b)Alexnet network with feature extractor c)Alexnet network with fine tuning of the three last layers


Figure 29: Top 1 accuracy vs operations, size & parameters

Alexnet [17]AlexNet is the network which had changed the vision and use of deep neuralnetwork by being the first one with such a large network. The network isused for classification of images and from a 256*256 image give a probabilityto belong to one of the 1000 classes. It uses large convolution in order toextract the spatial features from the images.

The main breakthrough of Alexnet is the use of GPU for the first time toperform the training which reduces it consequently.

VGG [30]The main difference between VGG developed in Oxford, and AlexNet is theuse of a series of smaller spatial convolution 3*3. The number of parametersand thus the power of the network increase a lot but as the computation time.

GoogleNetGoogleNet is a more recent network based on the concept of inception. An in-ception module if a parallel combination of different operations (convolutionfor example) done with a smaller amount of parameters. It has been shownthat parallelizing these operations lead to equivalent results with a reducedtime of computation.


Figure 30: Inception module

Resnet [12]Finally, one of the most famous networks is ResNet. Its differentiation isbased on the idea of that one output should feed the input of not only onebut two successive layers.

Figure 31: One output feeds two different inputs

The use of the network is then important and challenging, but the rightnetwork without the right data will never learn the desired output.

A.2.4.2 Number and quality of data: pre-processingIn deep learning and more generally in machine learning, the number of dataand their quality have a high importance on the performance of a network.Indeed, as the human brain, the more data the network will see the moreexperienced it will be on this task and then the more accurate it will be on aparticular task.

The number of data is then a key point, especially in the medical fieldwhere it is challenging to collect a large amount of data. Pre processing isa key step in the success of a network on a larger scale. Indeed, CT scansare performed following different protocols depending on the machine, thehospitals and also the user. All these differences result in differences in theimage: resolution, centring of the region of interest, different level of con-trast. But the network is trained on one set of images and the goal is to use


it for every CT scan in the future. Standard pre-processing steps include nor-malization (to show the network the same kind of images), data augmenta-tion (rotation, rescaling, flipping ...) and elimination of outliers. Then, onesends the same kind of images (same size, mean value ...) which enables thenetwork to perform better. The data augmentation is used to increase thevariability of training images.

Moreover, it is important to use balanced data while performing training,validation, and testing. In the case of classification, balanced data meanshaving around the same amount of data for each class.

A.2.4.3 Time-processing (CPU and GPU)Training time is a key aspect in the development of deep learning network.A model can take more than one week to be trained and so, it is necessary touse the right tools in order to accelerate these process. In order to get the bestresults, multiple experiments need to be done for each network.

Central Processing Unit (CPU) is found on each computer whereas Graph-ics Processing Unit (GPU) is added in order to accelerate the computation.Numerous differences exist between CPU and GPU, but in a nutshell, GPUallows much faster computation while CPU is easier to program. Nowadays,all the deep learning packages include the GPU integration which enablesone user to program efficiently his network on GPU and so to save a lot ofcomputational time.

A.2.4.4 OverfittingThe main challenge of training neural network with a lot of data is to tradewith the bias-variance dilemma. Bias represents the error in the result fromwrong assumptions in the algorithm. Variance is the error from fluctuationsin the training set.

As shown on CITE FIGURE, it is impossible to have both a low bias andvariance, and then it is necessary to find a compromise.

In order to apply the model for general use, one must avoid overfitting.We call overfitting when the model learn too many particularities from thetraining data (for example a specific noise due to the CT acquisition at thehospital) and thus doesn’t perform well on unseen data from another dataset.

One easy way to track overfitting is to track the validation error (or vali-dation loss), as soon as it increases again, it means that what the network justlearned is specific to the training data and cannot be generalized.

Once all these challenges have been taken into account, the performanceof a model has to be evaluated thanks to some evaluators in order to be ableto compare different model and results.


Figure 32: Choice of complexity depending on error

A.2.4.5 Performance evaluationROC and AUCMany ways exist in order to evaluate the performance of a network. As wesaw in the previous part, one needs to follow the error in order to stop thetraining at the right moment.

Another easy way to assess the performance of a network is the accuracy.The accuracy is mainly assessed in machine learning by the area under thecurve of a ROC (Receiver operating characteristic) for a binary classifier. [29]This is the plotting of TPR (true positive rate) against the FPR (false posi-tive rate). Then comparing these curves from different models give a goodcomparison of the performance of these models.

Figure 33: Example of ROC curve


Prediction No Prediction Yes

Actual No True Negatives (TN) False Positives (FP)Actual Yes False Negatives (FN) True Positives (TP)

Table 4: Confusion Matrix

Confusion MatrixThe confusion matrix is another way to display the performance of a classi-fier. It is displayed as a table summarizing the true and false positive andnegative. The matrix as the following form for a binary classifier:

The advantage of a confusion matrix is all the rates that we can computefrom it:Accuracy: correctness of the classifier

accuracy =TP + TN

Total(8)

Error rate: inaccuracy of the classifier

errorrate =FP + FN

Total(9)

Specificity: when the reality is no, how often the classifier does predict no:

specificity =TN

TN + FP(10)

True positive rate =TP

TP + FN(11)

False positive rate =FP

TN + FP(12)

Sensitivity or Precision: when the reality is yes, how often the classifier doespredict yes:

precision =TP

TP + FN(13)

All these values enable us to better understand how the network worksand what it’s able to learn. For example, a high sensitivity and low specificitymean that the network is great as finding the desired object (high sensitivity)but also that it thinks that many non-objects are objects (low specificity).

All these performance tools will be used on my project in order to evaluatemy network and compare different approaches.

TRITA TRITA-CBH-GRU-2018:15

www.kth.se

Documents

Assessment of lung damages from CT images using machine