Ensemble of Deep Convolutional Neural Networks for ... · Ensemble of Deep Convolutional Neural Networks for Learning to ... learning with the DRIVE ... testing over the DRIVE test

Ensemble of Deep Convolutional Neural Networks for Learning toDetect Retinal Vessels in Fundus Images

Debapriya Maji∗1, Anirban Santara∗2, Pabitra Mitra2, Debdoot Sheet2

Abstract— Vision impairment due to pathological damage ofthe retina can largely be prevented through periodic screeningusing fundus color imaging. However the challenge with largescale screening is the inability to exhaustively detect fine bloodvessels crucial to disease diagnosis. In this work we presenta computational imaging framework using deep and ensemblelearning for reliable detection of blood vessels in fundus colorimages. An ensemble of deep convolutional neural networksis trained to segment vessel and non-vessel areas of a colorfundus image. During inference, the responses of the individualConvNets of the ensemble are averaged to form the finalsegmentation. In experimental evaluation with the DRIVEdatabase, we achieve the objective of vessel detection withmaximum average accuracy of 94.7% and area under ROCcurve of 0.9283.

Index Terms— Computational imaging, deep learning, con-volutional neural network, ensemble learning, vessel detection.

I. INTRODUCTION

Pathological conditions of the retina examined throughregular screening [1], [2] can heavily assist prevention ofvisual blindness. Fundus imaging is the most widely usedmodality for early screening and detection of such blindnesscausing diseases like diabetic retinopathy, glucoma, age-related macular degeneration [3], hypertension and strokeinduced changes [4]. Imaging of fundus has largely improvedwith progress from the film based photography camerato use of electronic imaging sensors; as well as red freeimaging, stereo photography, hyperspectral imaging, angiog-raphy, etc. [5], thereby reducing inter- and intra-observerreporting variability. Retinal image analysis has also sig-nificantly contributed to this technological development [5],[6]. Since fundus imaging is predominantly used for firstlevel of abnormality screening, research focus includes: (i)detection and segmentation of retinal structures (vessels,fovea, optic disc), (ii) segmentation of abnormalities, and (iii)quality quantification of images acquired to assess reportingfitness [5].

Related Work: The process of clinical reporting of retinalabnormalities is systematic and lesions are reported withrespect to their location from vessels or optic disc. Computerassisted diagnosis systems are accordingly being developedto improve the clinical workflow [5]. Some of the devel-opments include, assessment of image quality [8]; bloodvessel detection [7], branch pattern, diameter and vascular

1 D. Maji is with Texas Instruments Inc.2 A. Santara, P. Mitra and D. Sheet are with the Indian In-

stitute of Technology Kharagpur, Kharagpur, WB 721302, India.anirban [email protected]

∗ equal contribution

(a) Fundus image (b) Ground truth (c) Detected vessels

(d) Proposed Methodology

Fig. 1. Retinal vessel detection using our proposed ensemble of ConvNets.Sample #16 in [7].

tree analysis [3], [4], [9], [10]; followed by reporting oflesions and their location with respect to the vessels [4],[10]. An important challenge in this context is robust andexhaustive detection of retinal vessels in color fundus imagesleveraging the potential of computer assisted diagnosis [5],[11] and assist in routine screening [3], [4].

Challenge: Methods for vessel detection and segmentation[5]–[7], [9], [10], [12] predominantly use image filters,vector geometry, statistical distribution studies, and machinelearning of low-level features and photon distribution modelsfor vessel detection. Such methods rely on use of handcraftedfeatures or heuristic assumptions for solving the problemand are not generalized to learn pattern attributes from thedata itself, thus making them vulnerable to performancesubjectivity on account of the method’s inherent weaknesses.Recently fully data-driven ,deep learning based models havebeen proposed [13]. However, they are weaker in perfor-mance compared to the state of the art methods that usethe conventional paradigm. The primary challenge here isto design an end-to-end framework which learns patternrepresentation from the data without any domain knowledgebased heuristic information to identify both coarse and finevascular structures and is at least at par if not better than theheuristic-based models.

Approach: This paper makes an attempt to ameliorate theissue of subjectivity induced bias in feature representation bytraining an ensemble of 12 Convolutional Neural Networks(ConvNets) [14] on raw color fundus images to discriminate

arX

iv:1

603.

0483

3v1

[cs

.LG

] 1

5 M

ar 2

016

vessel pixels from non-vessel ones. Fig. 1 illustrates anexample of exhaustive retinal vessel detection using thisapproach. Each ConvNet has three convolutional layers andtwo fully connected layers and is trained independently onrandomly selected patches from the training images. At thetime of inference, the vesselness-probabilities independentlyoutput by each ConvNet are averaged to form the finalvesselness probability of each pixel.§II gives a brief theoretical background of the proposed

method. The problem statement is formally defined in §IIIand the proposed approach is described in §IV. The resultsof experimental evaluation on DRIVE dataset have beenpresented in §V. The paper is concluded with a summary ofthe proposed method and a discussion of possible impact ofend-to-end deep learning based solutions for medical imageanalysis in §VI.

II. THEORETICAL BACKGROUND

This section introduces some concepts regarding ConvNetsand ensemble learning which form the pillars of the proposedsolution.

Convolutional Neural Networks: Convolutional neuralnetworks (CNN or ConvNet) are a special category of artifi-cial neural networks designed for processing data with a grid-like structure [14], [15]. The ConvNet architecture is basedon sparse interactions and parameter sharing and is highlyeffective for efficient learning of spatial invariances in images[16], [17]. There are four kinds of layers in a typical ConvNetarchitecture: convolutional (conv), pooling (pool), fully-connected (affine) and rectifying linear unit (ReLU). Eachconvolutional layer transforms one set of feature maps intoanother set of feature maps by convolution with a set offilters. Mathematically, if W l

i and bli denote the weights andthe bias of the ith filter of the lth convolutional layer andH l

i be its activation-map, then:

H li = H l−1 ⊗W l

i + bli (1)

where ⊗ is the convolution operator. Pooling layers performa spatial downsampling of the input feature maps. Poolinghelps to make the representation become invariant to smalltranslations of the input. Fully-connected layers are similarto the layers in a vanilla neural network. Let W l denote theincoming weight matrix and bl, the bias vector of a fully-connected layer, l. Then:

H l = flatten(H l−1) ∗W l ⊕ bl (2)

where flatten(.) operator tiles the feature-maps of the inputvolume along the height, ′∗′ is matrix multiplication and ′⊕′is element-wise addition. ReLU layers perform a pointwiserectification of the input and correspond to the activationfunction. For the ith unit of layer l:

H li = max(H l−1

i , 0) (3)

In a deep ConvNet, units in the deeper layers indirectlyinteract with a larger area of the input, thus forming a highlevel abstraction of the input data.

Ensemble learning: Ensemble learning is a technique ofusing multiple models or experts for solving a particularartificial intelligence problem [18]. Ensemble methods seekto promote diversity among the models they combine andreduce the problem related to overfitting of the training data.The outputs of the individual models of the ensemble arecombined (e.g. by averaging) to form the final prediction.Concretely, if {m1, . . . ,mk} be k models of an ensemble andp (x = yi|mj) is the probability that the input x is classifiedas yi under the model mj , then the ensemble predicts:

p (x = yi|m1, . . . ,mk) =1

k

k∑j=1

p (x = yi|mj) (4)

Ensemble learning promotes better generalization and oftenprovides higher accuracy of prediction than the individualmodels.

III. PROBLEM STATEMENT

Let I be an image acquired by the RGB sensor of acolor fundus camera. The intensity observed at location xis denoted by g(x). Let N(x) be a set of pixels in thelocal neighborhood of x. Let ω = {vessel, non-vascular}be the set of class labels for the pixel at location x. In amachine learning framework, the probability of finding atissue of type ω at location x, p (ω|I,x) is modelled by aclass of functions H(ω|I, N(x),x, θ1, θ2) where θ1 is a setof parameters which are learned from the training data andθ2, a set of hyperparameters tuned using the validation data.In the proposed method, H(.) is an ensemble of ConvNets,the architecture of which is described next.

IV. PROPOSED SOLUTION

Each layer of a ConvNet transforms one volume offeatures into another. A volume of features is describedas N × V × H where N is the number of feature mapsof spatial dimension V × H . The input to each ConvNetof the proposed ensemble is a 3 × 31 × 31 color fundusimage patch. The ConvNets have the same organizationof layers which can be described as: input- [conv -relu]-[conv - relu - pool] x 2 - affine- relu - [affine with dropout] - softmax.Fig.2 gives a schematic diagram of the organization. Eachconv layer has receptive field size - 4 × 4, stride - 1 andoutput volume - 64 × 32 × 32. The pool layers havereceptive field size - 2 × 2 and stride - 2. Dropout [19] isa regularization method for neural networks that enforcessparsity, prevents co-adaptation of features and promotesbetter generalization by forcing a fraction of neurons to beinactive during each episode of learning. The output of thefinal layer is passed to a softmax function which convertsthe outputs into class probabilities. Let Hout

i denote theactivation of the ith neuron of the fully-connected output-layer and P (oi) denote the posterior probability of the ith

output class. Then:

P (oi) = softmax(Hout)i =eH

outi∑

j eHout

j

(5)

Fig. 2. Organization of layers in the ConvNets of the proposed ensemble. S - stride, P - zero padding, n - number of hidden units in penultimate affinelayer variable in different ConvNets (see §V for details)

V. EXPERIMENTAL RESULTS AND DISCUSSION

This section presents experimental validation of the pro-posed technique and its performance comparison with earliermethods [11].

Dataset: The ensemble of ConvNets is is evaluated bylearning with the DRIVE training set (image id. 21-40) andtesting over the DRIVE test set (image id. 1-20)1.

Learning mechanism: Each ConvNet is trained indepen-dently on a set of 60000 randomly chosen 3×31×31 patches.Learning rate and annealing rate were kept constant acrossmodels at 5e− 4 and 0.95 respectively. Dropout probability,L2 regularization coefficient and number of hidden units inthe penultimate affine layer of the different models were sam-pled respectively from U ([0.5, 0.9]), U ([1e− 3, 2.5e− 3])and U ({128, 256, 512}) where U(.) denotes uniform prob-ability distribution over a given range. The models weretrained using RMSProp algorithm [20] with minibatch size200.

Performance assessment: Table I presents the accuracyand consistency of detection in comparison with those re-ported in earlier techniques. It is clearly evident that althoughour approach does not have the highest accuracy as comparedwith other methods, it does exhibit superior performancethan the previously proposed deep learning based methodfor learning vessel representations from data [13]. The kappascore being a study of observer consistency indicates sensi-tivity of the technique to detect both coarse and fine vesselsas desired. Typical response to detection of both coarse andfine vessels are presented in Fig. 3.

Fig: 4 gives the receiver operating characteristic (ROC)curve of the proposed method. Area under ROC curveobtained is 0.9283.

VI. CONCLUSION

This paper presents a ConvNet-ensemble based frameworkfor processing color fundus images for detection of coarseand fine vessels. The method is evaluated experimentally onthe DRIVE dataset. The remarkable ability of ConvNets to

1DRIVE dataset: http://www.isi.uu.nl/Research/Databases/DRIVE/

(a) Image # 5 (b) Mag. view (c) Ground truth (d) Detected

(e) Image # 16 (f) Mag. view (g) Ground truth (h) Detected

(i) Image # 16 (j) Mag. view (k) Ground truth (l) Detected

Fig. 3. Detection of (a-d) coarse vessels in Sample #5 and both (e-h)coarse and (i-l) fine vessels in Sample #16 in [7].

TABLE IPERFORMANCE COMPARISON OF DIFFERENT ALGORITHMS.

Max. avg. Accuracy KappaProposed method 0.9470 0.7031Maji et al. [13] 0.9327 0.6287Sheet et al. [12] 0.9766 0.8213Second observer 0.9473 0.7589Staal et al. [7]. 0.9422 -Niemeijer et al. 0.9416 0.7145Zana et al. 0.9377 0.6971Jiang et al. 0.9212 0.6399Martınez-Perez et al. 0.9181 0.6389Chaudhuri et al. 0.8773 0.3357

recognize images and that of ensemble learning at general-ization is leveraged to design a heuristics independent, datadriven approach to analyzing medical images. This presentsa feasible solution to subjectivity induced bias in medical

http://www.isi.uu.nl/Research/Databases/DRIVE/

Fig. 4. Receiver Operating Characteristic (ROC) curve

image analysis. This is an improvement of our previous workon data-driven analysis of fundus images [13]. This approachin general also provides a strong alternative approach tosolve complex medical data analysis problem through deeplearning combined with the power of ensemble learning.

REFERENCES

[1] A. Tuulonen, P. J. Airaksinen, A. Montagna, and H. Nieminen,“Screening for glaucoma with a non-mydriatic fundus camera,” ActaOphthalmol., vol. 68, no. 4, pp. 445–449, 1990.

[2] E. Stefensson, T. Bek, M. Porta, N. Larsen, J. K. Kristinsson, andE. Agardh, “Screening and prevention of diabetic blindness,” ActaOphthalmol., vol. 78, no. 4, pp. 374–385, 2000.

[3] C. Heneghan, J. Flynn, M. O-Keefe, and M. Cahill, “Characterizationof changes in blood vessel width and tortuosity in retinopathy ofprematurity using image analysis,” Medical Image Analysis, vol. 6,no. 4, pp. 407 – 429, 2002.

[4] T.-Y. Wong, R. Klein, B. E. K. Klein, J. M. Tielsch, L. Hubbard, andF. J. Nieto, “Retinal microvascular abnormalities and their relationshipwith hypertension, cardiovascular disease, and mortality,” Survey,Ophthal., vol. 46, no. 1, pp. 59 – 80, 2001.

[5] M.D. Abramoff, M.K. Garvin, and M. Sonka, “Retinal imaging andimage analysis,” IEEE Rev. Biomed. Engg., vol. 3, pp. 169 –208, 2010.

[6] N. Patton, T. M. Aslam, T. MacGillivray, I. J. Deary, B. Dhillon, R. H.Eikelboom, K. Yogesan, and I. J. Constable, “Retinal image analysis:Concepts, applications and potential,” Progress in Retinal and EyeResearch, vol. 25, no. 1, pp. 99 – 127, 2006.

[7] J. Staal, M.D. Abramoff, M. Niemeijer, M.A. Viergever, and B. vanGinneken, “Ridge-based vessel segmentation in color images of theretina,” IEEE Trans. Med. Imaging, vol. 23, no. 4, pp. 501 –509, Apr.2004.

[8] M.E. Gegundez-Arias, A. Aquino, J.M. Bravo, and D. Marin, “Afunction for quality evaluation of retinal vessel segmentations,” IEEETrans. Med. Imaging, vol. 31, no. 2, pp. 231 –239, Feb. 2012.

[9] M. Sofka and C.V. Stewart, “Retinal vessel centerline extraction usingmultiscale matched filters, confidence and edge measures,” IEEETrans. Med. Imaging, vol. 25, no. 12, pp. 1531 –1546, Dec. 2006.

[10] J. Jan, J. Odstrcilik, J. Gazarek, and R. Kolar, “Retinal image analysisaimed at blood vessel tree segmentation and early detection of neural-layer deterioration,” Comput. Med. Imaging and Graphics, vol. 36,no. 6, pp. 431 – 441, 2012.

[11] M. Niemeijer, J. Staal, B. van Ginneken, M. Loog, and M.D.Abramoff, “Comparative study of retinal vessel segmentation methodson a new publicly available database,” in SPIE Medical Imaging. Proc.SPIE, 2004, vol. 5370, pp. 648–656.

[12] D. Sheet, S. P. K. Karri, S. Conjeti, S. Ghosh, J. Chatterjee, and A. K.Ray, “Detection of retinal vessels in fundus images through transferlearning of tissue specific photon interaction statistical physics,” inProc. Int. Symp. Biomed. Imaging, 2013, pp. 1452–1456.

[13] D. Maji, A. Santara, S. Ghosh, D. Sheet, and P. Mitra, “Deep neuralnetwork and random forest hybrid architecture for learning to detectretinal vessels in fundus images,” in Engineering in Medicine andBiology Society (EMBC), 2015 37th Annual International Conferenceof the IEEE, 2015, pp. 3029–3032.

[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,” Proceedings of the IEEE,pp. 2278–2324, 1998.

[15] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, “Deep learn-ing,” Book in preparation for MIT Press, 2016.

[16] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,”Nature, vol. 521, pp. 436–444, 2015.

[17] Stephane Mallat, “Understanding deep convolutional networks,”arXiv:1601.04920, 2016.

[18] T.G. Dietterich, “Ensemble methods in machine learning,” LNCS, vol.1857, pp. 1–15, 2001.

[19] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever,and Ruslan Salakhutdinov, “Dropout: A simple way to prevent neuralnetworks from overfitting,” Journal of Machine Learning Research,vol. 15, pp. 1929–1958, 2014.

[20] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky, “Overview ofminibatch gradient descent,” .

Documents

Ensemble of Deep Convolutional Neural Networks for ... · Ensemble of Deep Convolutional Neural Networks for Learning to ... learning with the DRIVE ... testing over the DRIVE test