10
Quantitative analysis of patch-based fully convolutional neural networks for tissue segmentation on brain magnetic resonance imaging Jose Bernal * , Kaisar Kushibar, Mariano Cabezas, Sergi Valverde, Arnau Oliver, Xavier Llad´ o a Computer Vision and Robotics Institute Dept. of Computer Architecture and Technology University of Girona Ed. P-IV, Av. Lluis Santal´ os/n, 17003 Girona (Spain) Abstract Accurate brain tissue segmentation in Magnetic Resonance Imaging (MRI) has attracted the attention of medical doctors and researchers since variations in tissue volume help in diagnosing and monitoring neurological diseases. Several proposals have been designed throughout the years comprising conventional machine learning strategies as well as convolutional neural networks (CNN) approaches. In particular, in this paper, we analyse a sub-group of deep learning methods producing dense predictions. This branch, referred in the literature as Fully CNN (FCNN), is of interest as these architectures can process an input volume in less time than CNNs and local spatial dependencies may be encoded since several voxels are classified at once. Our study focuses on understanding architectural strengths and weaknesses of literature-like approaches. Hence, we implement eight FCNN architectures inspired by robust state-of-the-art methods on brain segmentation related tasks. We evaluate them using the IBSR18, MICCAI2012 and iSeg2017 datasets as they contain infant and adult data and exhibit varied voxel spacing, image quality, number of scans and available imaging modalities. The discussion is driven in three directions: comparison between 2D and 3D approaches, the importance of multiple modalities and overlapping as a sampling strategy for training and testing models. To encourage other researchers to explore the evaluation framework, a public version is accessible to download from our research website. Keywords: Quantitative analysis, brain MRI, tissue segmentation, fully convolutional neural networks 1. Introduction Automatic brain Magnetic Resonance Imaging (MRI) tissue segmentation continues being an active research topic in med- ical image analysis since it provides doctors with meaningful and reliable quantitative information, such as tissue volume measurements. This information is widely used to diagnose brain diseases and to evaluate progression through regular MRI analysis over time (Rovira et al., 2015; Steenwijk et al., 2016; Filippi et al., 2016). Hence, MRI and its study contribute to a better comprehension of the nature of brain problems and the eectiveness of new treatments. Several tissue segmentation algorithms have been proposed throughout the years. Many supervised machine learning meth- ods existed before the Convolutional Neural Network (CNN) era. A clear example of that is the pipelines that participated in the MRBrainS13 challenge (Mendrik et al., 2015). Commonly, intensity-based methods assumed each tissue could be repre- sented by its intensity values (Cardoso et al., 2013) (e.g. us- ing GMM models). Since noise and intensity inhomogeneities * Corresponding author Email addresses: [email protected] (Jose Bernal), [email protected] (Kaisar Kushibar), [email protected] (Mariano Cabezas), [email protected] (Sergi Valverde), [email protected] (Arnau Oliver), [email protected] (Xavier Llad´ o) degraded them, they were later equipped with spatial informa- tion (Clarke et al., 1995; Kapur et al., 1996; Liew & Yan, 2006; Valverde et al., 2015). Four main strategies were distinguished in the literature: (i) impose local contextual constraints using Markov Random Fields (MRF) (Zhang et al., 2001), (ii) in- clude penalty terms accounting for neighbourhood similarity in clustering objective functions (Pham, 2001), (iii) use Gibbs prior to model spatial characteristics of the brain (Shattuck et al., 2001) and (iv) introduce spatial information using prob- abilistic atlases (Ashburner & Friston, 2005; Ashburner et al., 2012). It is important to remark that some of these methods, like FAST (Zhang et al., 2001) and SPM (Ashburner & Friston, 2005; Ashburner et al., 2012), are still being used in medical centres due to their robustness and adaptability (Valverde et al., 2017). Nowadays, CNNs have become appealing to address this task in coming years since (i) they have achieved record-shattering performances in various fields in computer vision and (ii) they discover classification-suitable representations directly from the input data – unlike conventional machine-learning strate- gies. However, unlike traditional approaches, these methods still present two main issues when placed in real life scenar- ios: (i) lack of suciently labelled data and (ii) domain adap- tation issues – also related to generalisation problems. Sem- inal work on CNN for brain tissue segmentation date back to 2015 when Zhang et al. (2015) proposed a CNN to address in- arXiv:1801.06457v2 [cs.CV] 19 Feb 2018

Abstract arXiv:1801.06457v2 [cs.CV] 19 Feb 2018 · ferred in the literature as 2.5D (Lyksborg et al., 2015; Biren-baum & Greenspan, 2016; Kushibar et al., 2017). With ad-vances in

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Abstract arXiv:1801.06457v2 [cs.CV] 19 Feb 2018 · ferred in the literature as 2.5D (Lyksborg et al., 2015; Biren-baum & Greenspan, 2016; Kushibar et al., 2017). With ad-vances in

Quantitative analysis of patch-based fully convolutional neural networks for tissuesegmentation on brain magnetic resonance imaging

Jose Bernal∗, Kaisar Kushibar, Mariano Cabezas, Sergi Valverde, Arnau Oliver, Xavier Llado

aComputer Vision and Robotics InstituteDept. of Computer Architecture and Technology

University of GironaEd. P-IV, Av. Lluis Santalo s/n, 17003 Girona (Spain)

Abstract

Accurate brain tissue segmentation in Magnetic Resonance Imaging (MRI) has attracted the attention of medical doctors andresearchers since variations in tissue volume help in diagnosing and monitoring neurological diseases. Several proposals havebeen designed throughout the years comprising conventional machine learning strategies as well as convolutional neural networks(CNN) approaches. In particular, in this paper, we analyse a sub-group of deep learning methods producing dense predictions.This branch, referred in the literature as Fully CNN (FCNN), is of interest as these architectures can process an input volumein less time than CNNs and local spatial dependencies may be encoded since several voxels are classified at once. Our studyfocuses on understanding architectural strengths and weaknesses of literature-like approaches. Hence, we implement eight FCNNarchitectures inspired by robust state-of-the-art methods on brain segmentation related tasks. We evaluate them using the IBSR18,MICCAI2012 and iSeg2017 datasets as they contain infant and adult data and exhibit varied voxel spacing, image quality, numberof scans and available imaging modalities. The discussion is driven in three directions: comparison between 2D and 3D approaches,the importance of multiple modalities and overlapping as a sampling strategy for training and testing models. To encourage otherresearchers to explore the evaluation framework, a public version is accessible to download from our research website.

Keywords: Quantitative analysis, brain MRI, tissue segmentation, fully convolutional neural networks

1. Introduction

Automatic brain Magnetic Resonance Imaging (MRI) tissuesegmentation continues being an active research topic in med-ical image analysis since it provides doctors with meaningfuland reliable quantitative information, such as tissue volumemeasurements. This information is widely used to diagnosebrain diseases and to evaluate progression through regular MRIanalysis over time (Rovira et al., 2015; Steenwijk et al., 2016;Filippi et al., 2016). Hence, MRI and its study contribute to abetter comprehension of the nature of brain problems and theeffectiveness of new treatments.

Several tissue segmentation algorithms have been proposedthroughout the years. Many supervised machine learning meth-ods existed before the Convolutional Neural Network (CNN)era. A clear example of that is the pipelines that participated inthe MRBrainS13 challenge (Mendrik et al., 2015). Commonly,intensity-based methods assumed each tissue could be repre-sented by its intensity values (Cardoso et al., 2013) (e.g. us-ing GMM models). Since noise and intensity inhomogeneities

∗Corresponding authorEmail addresses: [email protected] (Jose Bernal),

[email protected] (Kaisar Kushibar),[email protected] (Mariano Cabezas),[email protected] (Sergi Valverde), [email protected](Arnau Oliver), [email protected] (Xavier Llado)

degraded them, they were later equipped with spatial informa-tion (Clarke et al., 1995; Kapur et al., 1996; Liew & Yan, 2006;Valverde et al., 2015). Four main strategies were distinguishedin the literature: (i) impose local contextual constraints usingMarkov Random Fields (MRF) (Zhang et al., 2001), (ii) in-clude penalty terms accounting for neighbourhood similarityin clustering objective functions (Pham, 2001), (iii) use Gibbsprior to model spatial characteristics of the brain (Shattucket al., 2001) and (iv) introduce spatial information using prob-abilistic atlases (Ashburner & Friston, 2005; Ashburner et al.,2012). It is important to remark that some of these methods,like FAST (Zhang et al., 2001) and SPM (Ashburner & Friston,2005; Ashburner et al., 2012), are still being used in medicalcentres due to their robustness and adaptability (Valverde et al.,2017).

Nowadays, CNNs have become appealing to address this taskin coming years since (i) they have achieved record-shatteringperformances in various fields in computer vision and (ii) theydiscover classification-suitable representations directly fromthe input data – unlike conventional machine-learning strate-gies. However, unlike traditional approaches, these methodsstill present two main issues when placed in real life scenar-ios: (i) lack of sufficiently labelled data and (ii) domain adap-tation issues – also related to generalisation problems. Sem-inal work on CNN for brain tissue segmentation date back to2015 when Zhang et al. (2015) proposed a CNN to address in-

arX

iv:1

801.

0645

7v2

[cs

.CV

] 1

9 Fe

b 20

18

Page 2: Abstract arXiv:1801.06457v2 [cs.CV] 19 Feb 2018 · ferred in the literature as 2.5D (Lyksborg et al., 2015; Biren-baum & Greenspan, 2016; Kushibar et al., 2017). With ad-vances in

fant brain tissue segmentation on MRI where tissue distribu-tions overlap and, hence, the GMM assumption does not hold.The authors showed that their CNN was suitable for the prob-lem and could outperform techniques, such as random forest,support vector machines, coupled level sets, and majority vot-ing. From thereon, many more sophisticated proposals havebeen devised (Litjens et al., 2017; Bernal et al., 2017).

Former CNN strategies for tissue segmentation were trainedto provide a single label given an input patch (Zhang et al.,2015; Moeskops et al., 2016; Chen et al., 2016). Naturally,both training and testing can be time-consuming and computa-tionally demanding. Also, the relationship between neighbour-ing segmented voxels is not encoded – in principle – and, con-sequently, additional components on the architecture (such asin (Stollenga et al., 2015)) or post-processing may be needed tosmooth results. These drawbacks can be diminished by adapt-ing the network to perform dense prediction. The prevailingapproach consists in replacing fully connected layers by 1 × 1convolutional layers – 1 × 1 × 1 if processing 3D data. Thisparticular group is known as Fully CNN (FCNN) (Long et al.,2015).

Regarding input dimensionality, three main streams are iden-tified: 2D, 2.5D and 3D. At the beginning of the CNN era, mostof the state-of-the-art CNN techniques were 2D, in part, due to(i) their initial usage on natural images, and (ii) computationlimitations of processing 3D volumes directly. Surely, three in-dependent 2D models can be arranged to handle patches fromaxial, sagittal and coronal at the same time, hence improvingacquired contextual information. These architectures are re-ferred in the literature as 2.5D (Lyksborg et al., 2015; Biren-baum & Greenspan, 2016; Kushibar et al., 2017). With ad-vances in technology, more 3D approaches have been devel-oped and attracted more researchers as they tend to outperform2D architectures (Bernal et al., 2017). Intuitively, the improve-ment of 3D over 2D and 2.5D lies on the fact that more infor-mation from the three orthogonal planes is integrated into thenetwork – i.e., more contextual knowledge is acquired. How-ever, this does not hint they always perform better (Ghafoorianet al., 2017). It is important to highlight that, to the best of ourknowledge, no 2.5D FCNN network has been created yet.

In this paper, we analyse quantitatively 4 × 2 FCNN archi-tectures for tissue segmentation on brain MRI. These networks,comprising 2D and 3D implementations, are inspired in four re-cent works (Cicek et al., 2016; Dolz et al., 2017; Guerrero et al.,2017; Kamnitsas et al., 2017). The models are tested on threewell-known datasets of infant and adult brain scans, with dif-ferent spatial resolution, voxel spacing, and image modalities.In this paper, we aim to (i) compare different FCNN strategiesfor tissue segmentation; (ii) quantitatively analyse the effect ofnetwork’s dimensionality (2D or 3D) for tissue segmentationand the impact of fusing information from single or multiplemodalities; and (iii) investigate the effects of extracting patcheswith a certain degree of overlap as a sampling strategy in bothtraining and testing. To the best of our knowledge, this is thefirst work providing a comprehensive evaluation of FCNNs forthe task mentioned above.

The rest of the paper is organised as follows. In Section 2, we

Table 1: Relevant information from the considered datasets. In the table, theelements to be considered are presented in the first column and the correspond-ing information from IBSR18, MICCAI 2012 and iSeg2017 are detailed in thefollowing ones. In the row related to the number of scans (with GT), the num-ber of training and test volumes is separated by a + sign. For both IBSR18 andiSeg2017, the evaluation is carried out using leave-one-out cross-validation.

Item IBSR18 MICCAI 2012 iSeg2017Target Adult Adult InfantsNumber of scans 18 15 + 20 10Bias-field corrected Yes Yes YesIntensity corrected No Yes NoSkull stripped No No YesVoxel spacing Anisotropic Isotropic IsotropicModalities T1-w T1-w T1-w, T2-w

present our evaluation framework: selected datasets, assessednetworks, aspects to analyse, pipeline description and imple-mentation details. Results are reported in Section 3 and anal-ysed in Section 4. Final remarks are discussed in Section 5.

2. Materials and methods

2.1. Considered datasets

Public datasets are commonly used for assessing brain MRItissue segmentation algorithms as they provide ground truth la-bels. In this work, we consider one publicly available repositoryand two challenges: Internet Brain Segmentation Repository18 (IBSR18)1, MICCAI Multi-Atlas Labeling challenge 2012(MICCAI 2012)2 and 6-month infant brain MRI segmentation(iSeg2017)3, respectively. The datasets were chosen since (i)they have been widely used in the literature to compare differ-ent methods and also (ii) they contain infants and adults data,with different voxel spacing and a varied number of scans. Touse annotations of MICCAI 2012, we mapped all the labels toform the three tissue classes. We believe that these two factorsallow us to see how robust, generalised and useful in differ-ent scenarios the algorithms can be. Specific details of thesedatasets are presented in Table 1.

2.2. FCNNs for brain MRI segmentation tasks

The proposed works using FCNN for brain MRI segmen-tation tasks are listed in Table 2. Proposals comprise singleor multiple flows of information – referred in the literature assingle-path and multi-path architectures, respectively. Whilesingle-path networks process input data faster than multi-path,knowledge fusion occurring in the latter strategy leads to bettersegmentation results: various feature maps from different in-terconnected modules and shallow layers are used to produce afinal verdict (Moeskops et al., 2016). Under this scheme, a net-work is provided with contrast, fine-grained and implicit con-textual information. Furthermore, proposals apply successiveconvolutions only or convolutions and de-convolutions in the

1http://www.nitrc.org/projects/ibsr2https://masi.vuse.vanderbilt.edu/workshop20123http://iseg2017.web.unc.edu/

2

Page 3: Abstract arXiv:1801.06457v2 [cs.CV] 19 Feb 2018 · ferred in the literature as 2.5D (Lyksborg et al., 2015; Biren-baum & Greenspan, 2016; Kushibar et al., 2017). With ad-vances in

Table 2: Significant information of state-of-the-art FCNN approaches for brainsegmentation tasks. The reference articles are listed in the first column. Thefollowing columns outline information regarding dimensionality of the input,high-level architectural details and segmentation problem addressed by the au-thors. U-shaped architectures are denoted by “[U]”.

Article Architecture TargetBrosch et al. (2016) 3D multi-path [U] LesionKleesiek et al. (2016) 3D single-path Skull strippingNie et al. (2016) 2D single-path [U] TissueShakeri et al. (2016) 2D single-path Sub-cortical structureKamnitsas et al. (2017) 3D multi-path Lesion/tumourDolz et al. (2017) 3D multi-path Sub-cortical structureGuerrero et al. (2017) 2D multi-path [U] LesionMoeskops et al. (2017) 2D single-path Structure

so-called u-shaped models. The latter approach commonly con-siders connections from high-resolution layers to up-sampledones to retain location and contextual information (Cicek et al.,2016; Ronneberger et al., 2015; Milletari et al., 2016).

From the papers indexed in Table 2, we built four multi-patharchitectures inspired by the works of Kamnitsas et al. (2017),Dolz et al. (2017), Cicek et al. (2016) and Guerrero et al. (2017)(i.e. two convolution-only and two u-shaped architectures). Thenetworks were implemented in 2D and 3D to investigate theeffect of the network’s dimensionality in tissue segmentation.All these architectures were implemented from scratch follow-ing the architectural details given in the original work and arepublicly available at our research website4. Although slight ar-chitectural differences may be observed, the core idea of theproposals is retained. More details of the networks are given inthe following sections.

2.2.1. Networks incorporating multi-resolution informationKamnitsas et al. (2017), proposed a two-path 3D FCNN ap-

plied to brain lesion segmentation. This approach achievedtop performance on two public benchmarks, BRATS 2015 andISLES 2015. By processing information of the targeted areafrom two different scales simultaneously, the network incorpo-rated local and larger contextual information, providing a moreaccurate response (Moeskops et al., 2016). A high-level schemeof the architecture is depicted in Fig. 1a. Initially, two inde-pendent feature extractor modules extracted maps from patchesfrom normal and downscaled versions of an input volume. Eachmodule was formed by eight 3 × 3 × 3 convolutional layers us-ing between 30 and 50 kernels. Afterwards, features maps werefused and mined by two intermediate 1 × 1 × 1 convolutionallayers with 150 kernels. Finally, a classification layer (another1×1×1 convolutional layer) produced segmentation predictionusing a softmax activation.

Dolz et al. (2017) presented a multi-resolution 3D FCNN ar-chitecture applied to sub-cortical structure tissue segmentation.A general illustration of the architecture is shown in Fig. 1b.The network consisted of 13 convolutional layers: nine 3×3×3,and four 1 × 1 × 1. Each one of these layers was immedi-ately followed by a Parametric Rectified Linear Unit (PReLU)

4https://github.com/NIC-VICOROB/tissue segmentation comparison

layer, except for the output layer which activation was softmax.Multi-resolution information was integrated into this architec-ture by concatenating feature maps from shallower layers to theones resulting from the last 3 × 3 × 3 convolutional layer. Asexplained by Hariharan et al. (2015), this kind of connectionsallows networks to learn semantic – coming from deeper lay-ers – as well as fine-grained localisation information – comingfrom shallow layers.

2.2.2. U-shaped networksIn the u-shaped network construction scheme, feature maps

from higher resolution layers are commonly merged to the oneson deconvolved maps to keep localisation information. Merg-ing has been addressed in the literature through concatena-tion (Cicek et al., 2016; Brosch et al., 2016) and addition (Guer-rero et al., 2017). In this paper, we consider networks usingboth approaches. A general scheme of our implementations in-spired in both works is displayed in Fig. 1c.

Cicek et al. (2016) proposed a 3D u-shaped FCNN, known as3D u-net. The network is formed by four convolution-poolinglayers and four deconvolution-convolution layers. The numberof kernels ranged from 32 in its bottommost layers to 256 inits topmost ones. In this design, maps from higher resolutionswere concatenated to upsampled maps. Each convolution wasimmediately followed by a Rectified Linear Unit (ReLU) acti-vation function.

Guerrero et al. (2017) designed a 2D u-shaped residual archi-tecture applied on lesion segmentation, referred as u-ResNet.The building block of this network was the residual modulewhich (i) added feature maps produced by 3 × 3- and 1 × 1-kernel convolution layers, (ii) normalised resulting features us-ing batchnorm, and, finally, (iii) used a ReLU activation. Thenetwork consisted of three residual modules with 32, 64 and128 kernels, each one followed by a 2 × 2 max pooling op-eration. Then, a single residual module with 256 kernels wasapplied. Afterwards, successive deconvolution-and-residual-module pairs were employed to enlarge the networks’ outputsize. The number of filters went from 256 to 32 in the layerbefore the prediction one. Maps from higher resolutions weremerged with deconvolved maps through addition.

2.3. Evaluation measurement

To evaluate proposals, we used the Dice similarity coefficient(DSC) (Dice, 1945; Crum et al., 2006). The DSC is used to de-termine the extent of overlap between a given segmentation andthe ground truth. Given an input volume V , its correspondingground truth G = {g1, g2, ..., gn}, n ∈ Z and obtained segmenta-tion output S = {s1, s2, ..., sm}, m ∈ Z the DSC is mathemati-cally expressed as

DS C (G, S ) = 2|G ∩ S ||G| + |S |

, (1)

where | · | represents the cardinality of the set. The values forDSC lay within [0, 1], where the interval extremes correspondto null or exact similarity between the compared surfaces, re-spectively. Additionally, we consider the Wilcoxon signed-rank

3

Page 4: Abstract arXiv:1801.06457v2 [cs.CV] 19 Feb 2018 · ferred in the literature as 2.5D (Lyksborg et al., 2015; Biren-baum & Greenspan, 2016; Kushibar et al., 2017). With ad-vances in

(a)

(b)

(c)

Figure 1: High level diagram of considered networks. Our implementations are inspired by the works of (a) Kamnitsas et al. (2017), (b) Dolz et al. (2017), and (c)Cicek et al. (2016) and Guerrero et al. (2017). Only 3D versions are shown. Notation is as follows: four-element tuples indicate number of channels and patch sizein x, y and z, in that order; triples in brackets indicate kernel size. In (c), merging is either concatenation or addition; CoreEle stands for core elements of the models(both of them are detailed on the bottom left and right corner of the (c)); the letter K on the core elements is the number of kernels at a given stage.

test to assess the statistical significance of differences amongarchitectures.

2.4. Aspects to evaluate

As we mentioned previously, this paper aims to analyse (i)overlapping patch extraction in training and testing, (ii) singleand multi-modality architectures and (iii) 2D and 3D strategies.Details on these three evaluation cornerstones are discussed inthe following sections.

2.4.1. Overlapping sampling in training and testingOne of the drawbacks of networks performing dense-

inference is that – under similar conditions – the number ofparameters increases. This implies that more samples should beused in training to obtain acceptable results. The most commonapproach consists in augmenting the input data through trans-formations – e.g. translation, rotation, scaling. However, if theoutput dimension is not equal to the input size, other optionscan be considered. For instance, patches can be extracted fromthe input volumes with a certain extent of overlap and, thus, thesame voxel would be seen several times in different neighbour-hoods. An example of this technique can be observed in Fig. 2.Consequently, (i) more samples are gathered, and (ii) networksare provided with information that may improve spatial consis-tency as illustrated in Fig. 3 (a-d).

The sampling strategy aforementioned can be additionallyenhanced by overlaying predicted patches. Unlike sophisticated

Figure 2: Patch extraction with null, medium and high overlap. Yellow andblue areas corresponds to the first and second blocks to consider. When there isoverlap among patches, voxels are seen in different neighbourhoods each time.

post-processing techniques, the network itself is used to im-prove its segmentation. As depicted in Fig. 3 (e-h), the lead-ing property of this post-processing technique is that small seg-mentation errors – e.g. holes and block boundary artefacts –are corrected. The consensus among outputs can be addressedthrough majority voting, for instance.

2.4.2. Input modalitiesDepending on the number of modalities available in a dataset,

approaches can be either single- or multi-modality. If manymodalities were acquired, networks could be adapted to processthem all at the same time either using different channels or vari-ous processing paths – also referred in the literature as early andlate fusion schemes (Ghafoorian et al., 2017), respectively. Nat-urally, regarding computational resources the former strategy isdesirable, but the latter may extract more valuable features. Inthis work, we consider the early fusion only. Regardless of thefusion scheme, merging different sources of information may

4

Page 5: Abstract arXiv:1801.06457v2 [cs.CV] 19 Feb 2018 · ferred in the literature as 2.5D (Lyksborg et al., 2015; Biren-baum & Greenspan, 2016; Kushibar et al., 2017). With ad-vances in

(a) T1-w (b) GT (c) No overlap (d) Overlap

(e) T1-w (f) GT (g) No overlap (h) Overlap

Figure 3: Segmentation using overlapping patch extraction in training (a-d) andtesting (e-h). From left to right, T1-w volume (a)-(e), ground truth (b)-(f), seg-mentation without overlap (c)-(g) and with overlap (d)-(h). The area inside thered box depicts notable changes between strategies. Notice that results obtainedwith overlapping sampling appear more similar to the ground truth. Colours forCSF, GM and WM are red, blue and green, respectively.

provide models with complementary features which may leadto enhanced segmentation results (Zhang et al., 2015).

2.4.3. Network’s dimensionalityThere are two streams of FCNN regarding its input dimen-

sionality: 2D and 3D. On the one hand, 2D architectures arefast, flexible and scalable; however, they ignore completely datafrom neighbouring slices, i.e. implicit information is reducedcompared to 3D approaches. On the other hand, 3D networksacquire valuable implicit contextual information from orthogo-nal planes. These strategies tend to lead to better performancethan 2D – even if labelling is carried out slice-by-slice – butthey are computationally demanding – exponential increase inparameters and resource consumption. Moreover, due to theincreasing number of parameters, these models may requirelarger training sets. Therefore, depending on the data itself,one approach would be more suitable than the other.

2.5. Implementation detailsGeneral tissue segmentation pipelines contemplate four es-

sential components: preprocessing, data preparation, classifica-tion and post-processing. Specific implementations of each oneof these elements can be plugged and unplugged as requiredto achieve the best performance. First, preprocessing is car-ried out by (i) removing skull, and (ii) normalising intensitiesbetween scans. We use the ground truth masks to address theformer tasks and standardise our data to have zero mean andunit variance. Second, data is prepared by extracting useful andoverlapping patches – containing information from one of thethree tissues. Third, classification takes place; segmentation ofan input volume is provided through means of majority votingin case of overlapping predictions. Fourth, no post-processingtechnique was considered.

All the networks were trained maximum for 20 epochs.Training stopping criterium was overfitting, which was moni-

Table 3: Number of parameters per considered architecture and per dimension-ality.

Dimensionality DM KK UNet UResNet2D 569,138 547,053 1,930,756 994,2123D 7,099,418 3,332,595 5,605,444 2,622,948

tored using an early stopping policy with patience equal to 2.For each of the datasets, we split the training set into train-ing and validation (80% and 20% of the volumes, respectively).Additionally, voxels laying on the background area were givena weight of zero to avoid considering them in the optimisa-tion/training process.

All the architectures have been implemented from scratch inPython, using the Keras library. From here on, our implemen-tations of Dolz et al. (2017), Kamnitsas et al. (2017), Ciceket al. (2016), and Guerrero et al. (2017) are denoted by DM,KK, UNet and UResNet, respectively. The number of param-eters per architecture is listed in Table 3. All the experimentshave been run on a GNU/Linux machine box running Ubuntu16.04, with 128GB RAM. CNN training and testing have beencarried out using a single TITAN-X PASCAL GPU (NVIDIAcorp., United States) with 8GB RAM. The developed frame-work for this work is currently available to download at ourresearch website. The source code includes architecture imple-mentation and experimental evaluation scripts.

3. Results

The evaluation conducted in this paper is three-fold. First, weinvestigate the effect of overlapping patches in both training andtesting stages. Second, we assess the improvement of multi-modality architectures over single-modality ones. Third, wecompare the different models on the three considered datasets.Note that, for the sake of simplicity, the network’ dimension-ality is shown as subscript (e.g. UResNet2D denotes the 2Dversion of the UResNet architecture).

3.1. Overlapping

To evaluate the effect of extracting overlapping patches intraining and testing, we run all the architectures on the threedatasets contemplating three levels: null, medium and high (ap-proximately 0%, 50% and 90%, respectively). On IBSR18 andiSeg2017, the evaluation was carried out using a leave-one-outcross-validation scheme. On MICCAI2012, the process con-sisted in using the given training and testing sets.

The first test consisted in quantifying improvement betweennetworks trained with either null or high degrees of overlap ontraining. Resulting p-values obtained on the three datasets aredepicted in Fig. 4. In all the cases, the model trained with highoverlap led to higher DSC values than when not. As it canbe observed, in most of the cases (49 out of 72), the overlap-ping sampling led to statistically significant higher performancethan when omitted. The two groups (convolutional-only and u-shaped) exhibited opposite behaviours. On the one hand, thehighest improvements are noted in u-shaped networks. This is

5

Page 6: Abstract arXiv:1801.06457v2 [cs.CV] 19 Feb 2018 · ferred in the literature as 2.5D (Lyksborg et al., 2015; Biren-baum & Greenspan, 2016; Kushibar et al., 2017). With ad-vances in

CSF

GM

WM

CSF

GM

WM

CSF

GM

WM

DM2D

DM3D

KK2D

KK3D

UNet2D

UNet3D

UResNet2D

UResNet3D

iSeg2017

MICCAI2012

IBSR18

p-value<0.001 p-value<0.01

p-value<0.05 p-value>0.05

Figure 4: p-values obtained when comparing DSC values when using null andhigh overlapping as sampling strategy only in training. From top to bottom,CSF, GM and WM values for IBSR18, MICCAI2012 and iSeg2017. In all thecases the model trained with high overlap led to higher performance than whennot.

related to the fact that non-overlap may mean not enough sam-ples. On the other hand, convolutional-only models evidencedthe least increase. Since output patches are smaller, more datacan be extracted and passed during training. Therefore, theycan provide already accurate results. This fact is illustrated bythe results of DM2D and KK2D.

The second test consisted in quantifying the improvementof extracting patches using combinations of the three consid-ered degrees of overlap during training and testing. As men-tioned previously, results were fused using a majority votingtechnique. We noted that the general trend is that the differencebetween results using null and high extends of overlap on pre-diction is not significant (p-values > 0.05). Also, IQR remainedthe same regardless of the method or dataset. Nevertheless, thegeneral trend was an improvement of mean DSC of at least 1%in the overlapping cases. Another important observation fromour experiments is that zero impact or slight degradation of theDSC values was noted when training with null overlap and test-ing with high overlap. Naturally, this last outcome is a conse-quence of merging predictions of a poorly trained classifier.

Medium level of overlap patch extraction, in both trainingand testing, led to improvement w.r.t. null degree cases butyielded lower values than when using a high extent of over-lap. That is to say, the general trend is: the more the extentof overlap, the higher the overall performance of the method.The price to pay for using farther levels of overlap is computa-tional time and power since the number of samples to processincreases proportionally. For example, given an input volumewith dimensions 256× 256× 256 and a network producing out-put size of 32 × 32 × 32, the number of possible patches to be

DSC

CSF GM WM

0.75

0.80

0.85

0.90

0.95

DM2D DM3D KK2D KK3D UNet2D UNet3D UResNet2D UResNet3D

Figure 5: Evaluating the impact of single or multiple modalities for tissue seg-mentation on the iSeg2017 dataset. The DSC values displayed in the plot wereobtained through leave-one-out cross-validation. In the plot, the same colour isused to represent each pair of single- and multi-modality versions of the samearchitecture. For each pair, left and right indicate whether the model considersa unique sequence or various, respectively. In the legend, subscripts indicatethe dimensionality of the architecture. According to our experiments, all themulti-modality architectures outperformed significantly their single-modalityanalogue for GM and WM.

extracted following the null, medium and high overlap policiesare 512, 3375 and 185193, respectively.

Since overlapping sampling proved useful, the resultsshowed in following sections correspond to the ones obtainedusing high overlap in both training and testing.

3.2. Single and multiple modalities

We performed leave-one-out cross-validation on theiSeg2017 dataset using the implemented 2D and 3D archi-tectures to assess the effect of single and numerous imagingsequences on the final segmentation. The results of thisexperiment are shown in Fig. 5.

As it can be observed, the more the input modalities, themore improved the segmentation. In this case, two modalitiesnot only allowed the network to achieve higher mean but alsoto reduce the IQR, i.e. results are consistently better. This be-haviour was evidenced regardless of architectural design or tis-sue type. For instance, while the best single modality strategyscored 0.937± 0.011, 0.891± 0.010 and 0.868± 0.016 for CSF,GM and WM, respectively; its multi-modality analogue yielded0.944 ± 0.008, 0.906 ± 0.008 and 0.887 ± 0.017 for the sameclasses. Additionally, most of the strategies using both T1-wand T2-w obtained DSC values which were statistically higherthan their single-modality counterparts (16 out of the 8×3 casesreturned p-values < 0.01).

3.3. Comparison of 2D and 3D FCNN architectures

The eight architectures were evaluated using their best pa-rameters according to the previous sections on three differentdatasets: IBSR18, MICCAI2012 and iSeg2017. The DSC meanand standard deviation values are shown in Table 4.

6

Page 7: Abstract arXiv:1801.06457v2 [cs.CV] 19 Feb 2018 · ferred in the literature as 2.5D (Lyksborg et al., 2015; Biren-baum & Greenspan, 2016; Kushibar et al., 2017). With ad-vances in

Table 4: DSC values obtained by DM, KK, UResNet and UNet on three analysed datasets. The highest DSC values per class and per dataset are in bold. The valueswith an asterisk (*) indicate that a certain architecture obtained a significantly higher score (p-value < 0.01) than its analogue.

Class DM KK UResNet UNet2D 3D 2D 3D 2D 3D 2D 3D

IBSR

18 CSF 0.87 ± 0.05 0.86 ± 0.07 0.88 ± 0.04∗ 0.80 ± 0.20 0.90 ± 0.03 0.89 ± 0.05 0.90 ± 0.03 0.88 ± 0.05GM 0.96 ± 0.01 0.96 ± 0.01 0.88 ± 0.04 0.96 ± 0.01∗ 0.96 ± 0.01 0.96 ± 0.01 0.96 ± 0.01 0.96 ± 0.01WM 0.92 ± 0.02 0.93 ± 0.02∗ 0.92 ± 0.02 0.92 ± 0.02 0.93 ± 0.02 0.93 ± 0.02 0.93 ± 0.02 0.93 ± 0.02

MIC

CA

I CSF 0.87 ± 0.04 0.91 ± 0.02∗ 0.81 ± 0.10 0.90 ± 0.03∗ 0.89 ± 0.04 0.91 ± 0.03∗ 0.90 ± 0.04 0.91 ± 0.03∗

GM 0.95 ± 0.02 0.96 ± 0.01∗ 0.95 ± 0.02 0.96 ± 0.01∗ 0.96 ± 0.02 0.96 ± 0.02∗ 0.96 ± 0.02 0.97 ± 0.01∗WM 0.92 ± 0.03 0.94 ± 0.02∗ 0.93 ± 0.02 0.94 ± 0.02∗ 0.93 ± 0.03 0.94 ± 0.02∗ 0.94 ± 0.03 0.95 ± 0.02∗

iSeg

2017 CSF 0.92 ± 0.01 0.94 ± 0.01∗ 0.92 ± 0.01 0.94 ± 0.01∗ 0.91 ± 0.01 0.93 ± 0.01∗ 0.92 ± 0.01 0.93 ± 0.01∗

GM 0.88 ± 0.01 0.91 ± 0.01∗ 0.88 ± 0.01 0.90 ± 0.01∗ 0.87 ± 0.01 0.89 ± 0.01∗ 0.88 ± 0.01 0.90 ± 0.01∗

WM 0.86 ± 0.01 0.89 ± 0.02∗ 0.86 ± 0.02 0.88 ± 0.02∗ 0.85 ± 0.02 0.87 ± 0.02∗ 0.86 ± 0.02 0.88 ± 0.01∗

The networks performing the best on IBSR18, MICCAI2012and iSeg2017 were UResNet2D and UNet2D, UNet3D, andDM3D. While for MICCAI2012 and iSeg2017, 3D approachestook the lead, 2D versions performed the best for IBSR18. Tak-ing into account the information in Table 1, 3D architecturesappear to be slightly more affected by differences in voxel spac-ing than 2D ones since the former set obtains similar or lowerresults than latter group – unlike in the other datasets. One ofthe reasons explaining this outcome could be the lack of suf-ficient data allowing 3D networks to understand variations invoxel spacing, i.e. 3D networks may be overfitting to one of thevoxel spacing groups. Nevertheless, the increase of the 2D net-works w.r.t. the 3D ones is not statistically significant overall.

Segmentation outputs obtained by the different methods onone of the volumes of the IBSR18 dataset are displayed inFig. 6. Note that architectures using 2D information weretrained with axial slices. As it can be seen in the illustration,since 2D architectures process each slice independently, the fi-nal segmentation is not necessarily accurate nor consistent: (i)subcortical structures exhibit unexpected shapes and holes, and(ii) sulci and gyri are not segmented finely. Thus, even if seg-mentation was carried out slice-by-slice, 3D approaches exhibita smoother segmentation presumably since these methods ex-ploit the 3D volumes directly.

Another thing to note in Fig. 6f is that segmentation providedby KK3D seems worse than the rest – even than its 2D analogue.The problem does not appear to be related to the number of pa-rameters since KK3D requires the least amount of parameters inthe 3D group, according to Table 3. This issue may be a conse-quence of the architectural design itself. Anisotropic voxels andheterogeneous spacing may be affecting the low-resolution pathof the network considerably. Hence, the overall performance isdegraded.

In comparison with state of the art, our methods showed sim-ilar or enhanced performance. First, the best DSC scores forIBSR18 were collected by Valverde et al. (2015). The highestvalues for CSF, GM and WM were 0.83 ± 0.08, 0.88 ± 0.04and 0.81 ± 0.07; while our best approach scored 0.90 ± 0.03,0.96 ± 0.01 and 0.92 ± 0.02, for the same classes. Second,the best-known values for tissue segmentation using the MIC-CAI 2012 dataset, were reported by Moeskops et al. (2016).Their strategy – a multi-path CNN – obtained 0.85 ± 0.04 and

0.94 ± 0.01 for CSF and WM, respectively; while our bestapproach yielded 0.92 ± 0.03 and 0.95 ± 0.02. In this case,we cannot establish a direct comparison of GM scores sincein Moeskops’ case this class was subdivided into (a) corticalGM and (b) basal ganglia and thalami. Third, based on the re-sults displayed in Table 4, our pipeline using DM3D led to thebest segmentation results on the iSeg2017 leave-one-out cross-validation. Hence, we submitted our approach to the onlinechallenge under the team name “nic vicorob”5. The mean DSCvalues were 0.951, 0.910 and 0.885 for CSF, GM and WM,correspondingly; and we also ranked top-5 in six of the nineevaluation scenarios (three classes, three measures).

4. Discussion

In this paper, we analysed quantitatively eight FCNN archi-tectures inspired by the literature of brain segmentation relatedtasks. The networks were assessed through three experimentsstudying the importance of (i) overlapping patch extraction, (ii)multiple modalities, and (iii) network’s dimensionality.

Our first experiment evaluated the impact of overlapping assampling strategy at training and testing stages. This overlap-ping sampling is explored as a workaround to the commonlyused data augmentation techniques. This procedure can be usedin this case as none of these networks processes a whole vol-ume at a time, but patches of it. Based on our results, the tech-nique proved beneficial since most of the networks obtainedsignificantly higher values than when not considered. In partic-ular, the four u-shaped architectures exhibited remarkable in-fluence of this approach, presumably since more samples areused during training and since the same area is seen with differ-ent neighbouring regions, enforcing spatial consistency. Over-lapping sampling in testing acts as a de-noising technique. Weobserved that this already-incorporated tool led to better perfor-mance than when absent as it helps filling small holes in areasexpected homogeneous. The improvement was found to be atleast 1%. Naturally, the main drawback of this technique is theexpertise of the classifier itself, since it may produce undesiredoutputs when poorly trained.

5Results can be viewed at http://iseg2017.web.unc.edu/rules/results/

7

Page 8: Abstract arXiv:1801.06457v2 [cs.CV] 19 Feb 2018 · ferred in the literature as 2.5D (Lyksborg et al., 2015; Biren-baum & Greenspan, 2016; Kushibar et al., 2017). With ad-vances in

(a) Original (b) Ground truth

(c) DM2D (d) DM3D (e) KK2D (f) KK3D

(g) UResNet2D (h) UResNet3D (i) UNet2D (j) UNet3D

Figure 6: Segmentation output of the eight considered methods. The ground truth is displayed in (a) and the corresponding segmentation in (b-i). The coloursfor CSF, GM and WM, are red, blue and green, respectively. White arrows point out areas where differences w.r.t. the ground truth in (a) are more noticeable.Architectures using 2D information were trained with axial slices.

Our second experiment assessed the effect of single and mul-tiple imaging sequences on the final segmentation. We ob-served that regardless of the segmentation network, the inclu-sion of various modalities led to significantly better segmen-tations that in the single modality approach. This situationmay be a consequence of networks being able to extract valu-able contrast information. Improvements were noted concern-ing the mean as well as the dispersion of the values yieldedby the methods. Although this outcome is aligned with the lit-erature (Zhang et al., 2015), additional trials on more datasetsshould be carried out to drawn stronger conclusions. Addition-ally, future work should consider evaluating tissue segmenta-tion in the presence of pathologies and using more imaging se-quences such as FLAIR and PD.

Our third experiment evaluated significant differences be-tween 2D and 3D methods on the three considered datasets.In general, 3D architectures produced significantly higher per-formance than their 2D analogues. However, in one of ourdatasets, IBSR18, the results were quite similar between thetwo groups. Since the other two sets were re-sampled to obtainisotropic voxels, this outcome seems to be a consequence of theheterogeneity of the data in IBSR18, i.e. 2D methods seem tobe more resilient to this kind of issues than 3D ones.

Regarding network design, we observed that networks us-ing information from shallower layers in deeper ones achievedhigher performance than those using features directly from theinput volume. Note the difference is intensified in hetero-

geneous datasets, IBSR18, where the latter strategy performsworse on average. Although it is only two networks, KK2D

and KK3D, this situation may underline the importance of in-ternal connections (e.g. skip connections, residual connec-tions) and fusion of multi-resolution information to segmentmore accurately. No remarkable difference was seen betweenconvolutional-only and u-shaped architectures, except for pro-cessing times. In both training and testing, u-shaped networksproduce segmentation faster than convolutional-only networks:u-shaped models require extracting less number of patches andprovide a more prominent output at a time. For instance, in test-ing time and using a high degree of overlap, UNet3D can processa volume of size 256 × 256 × 256 in around 130 seconds whileDM3D can take up to 360 seconds.

Regarding general performance, two methods, DM3D andUNet3D, displayed the best results. It is important to remarkthat our specific implementation of the latter architecture (i) re-quired 30% fewer parameters to be set than the former, and (ii)classifies ≈ 32K voxels more at a time which makes it appealingregarding computational speed. If the priority is time (trainingand testing), UResNet is a suitable option since it produces thesame output size as the UNet, but the number of parameters isapproximately half. Therefore, patch-based u-shaped networksare recommended to address tissue segmentation compared toconvolutional-only approaches.

Taking into account results reported in the literature, weachieved top performance for IBSR18, MICCAI2012 and

8

Page 9: Abstract arXiv:1801.06457v2 [cs.CV] 19 Feb 2018 · ferred in the literature as 2.5D (Lyksborg et al., 2015; Biren-baum & Greenspan, 2016; Kushibar et al., 2017). With ad-vances in

iSeg2017 with our implemented architectures. Three relevantthings to note in this work. First, none of these networks hasexplicitly been tweaked to the scenarios, a typical pipeline hasbeen used. Hence, it is possible to compare them under simi-lar conditions. Approaches expressly tuned for challenges maywin, but it does not imply they will work identically – usingthe same set-up – on real-life scenarios. Second, although thesestrategies have shown acceptable results, more development ondomain adaptation and transfer learning (zero-shot or one-shottraining) should be carried out to implement them in medicalcentres. Third, we do not intend to compare original works. Ourimplementations are inspired by the original works, but generalpipelines are not taken into account in here. In short, our studyfocus on understanding architectural strengths and weaknessesof literature-like approaches.

5. Conclusions

In this paper, we have analysed quantitatively 4×2 FCNN ar-chitectures, 2D and 3D, for tissue segmentation on brain MRI.These networks were implemented inspired by four recentlyproposed networks (Cicek et al., 2016; Dolz et al., 2017; Guer-rero et al., 2017; Kamnitsas et al., 2017). Among other charac-teristics, these methods comprised (i) convolutional-only and u-shaped architectures, (ii) single- and multi-modality inputs, (iii)2D and 3D network dimensionality, (iv) varied implementationof multi-path schemes, and (v) different number of parameters.

The eight networks were tested using three different well-known datasets: IBSR18, MICCAI2012 and iSeg2017. Thesedatasets were considered since they were acquired with differ-ent configuration parameters. Testing scenarios evaluated theimpact of (i) overlapping sampling on both training and testing,(ii) multiple modalities, and (iii) 2D and 3D inputs on tissuesegmentation. First, we observed that extracting patches witha certain degree of overlap among themselves led consistentlyto higher performance. The same approach on testing did notshow a relevant improvement (around 1% in DSC). It is a de-noising tool that comes along with the trained network. Second,we noted that using multiple modalities – when available – canhelp the method to achieve significantly higher accuracy val-ues. Third, based on our evaluation for tissue segmentation, 3Dmethods tend to outperform their 2D counterpart. However, itis relevant to recognise that 3D methods appear slightly moreaffected to variations in voxel spacing. Additionally, networksimplemented in this paper were able to deliver state-of-the-artresults on IBSR18 and MICCAI2012. Our best approach on theiSeg2017 ranked top-5 in most of the on-line testing scenarios.

To encourage other researchers to use the implemented eval-uation framework and FCNN architectures, we have released apublic version of it at our research website.

Acknowledgments

Jose Bernal and Kaisar Kushibar hold FI-DGR2017 grantsfrom the Catalan Government with reference numbers 2017FIB00476 and 2017FI B00372 respectively. Mariano Cabezas

holds a Juan de la Cierva - Incorporacion grant from the Span-ish Government with reference number IJCI-2016-29240. Thiswork has been partially supported by La Fundacio la Marato deTV3, by Retos de Investigacio TIN2014-55710-R, TIN2015-73563-JIN and DPI2017-86696-R from the Ministerio de Cien-cia y Tecnologıa. The authors gratefully acknowledge the sup-port of the NVIDIA Corporation with their donation of theTITAN-X PASCAL GPU used in this research. The authorswould like to thank the organisers of the iSeg2017 challengefor providing the data.

References

Ashburner, J., Barnes, G., & Chen, C. (2012). SPM8 Manual. http://www.

fil.ion.ucl.ac.uk/. [Online; accessed 18 May 2017].Ashburner, J., & Friston, K. J. (2005). Unified segmentation. NeuroImage, 26,

839 – 851.Bernal, J., Kushibar, K., Asfaw, D. S., Valverde, S., Oliver, A., Martı, R.,

& Llado, X. (2017). Deep convolutional neural networks for brain imageanalysis on magnetic resonance imaging: a review. CoRR, abs/1712.03747.arXiv:1712.03747.

Birenbaum, A., & Greenspan, H. (2016). Longitudinal multiple sclerosis lesionsegmentation using multi-view convolutional neural networks. In Interna-tional Workshop on Large-Scale Annotation of Biomedical Data and ExpertLabel Synthesis (pp. 58–67). Springer.

Brosch, T., Tang, L. Y., Yoo, Y., Li, D. K., Traboulsee, A., & Tam, R. (2016).Deep 3D convolutional encoder networks with shortcuts for multiscale fea-ture integration applied to multiple sclerosis lesion segmentation. IEEETransactions on Medical Imaging, 35, 1229–1239.

Cardoso, M. J., Melbourne, A., Kendall, G. S., Modat, M., Robertson, N. J.,Marlow, N., & Ourselin, S. (2013). AdaPT: an adaptive preterm segmenta-tion algorithm for neonatal brain MRI. NeuroImage, 65, 97–108.

Chen, H., Dou, Q., Yu, L., Qin, J., & Heng, P.-A. (2016). VoxResNet: Deepvoxelwise residual networks for brain segmentation from 3D MR images.CoRR, abs/1608.05895. arXiv:1608.05895.

Cicek, O., Abdulkadir, A., Lienkamp, S. S., Brox, T., & Ronneberger, O.(2016). 3D u-net: learning dense volumetric segmentation from sparse an-notation. In International Conference on Medical Image Computing andComputer-Assisted Intervention (pp. 424–432). Springer.

Clarke, L., Velthuizen, R., Camacho, M., Heine, J., Vaidyanathan, M., Hall,L., Thatcher, R., & Silbiger, M. (1995). MRI segmentation: methods andapplications. Magnetic resonance imaging, 13, 343–368.

Crum, W. R., Camara, O., & Hill, D. L. G. (2006). Generalized overlap mea-sures for evaluation and validation in medical image analysis. IEEE Trans-actions on Medical Imaging, 25, 1451–1461.

Dice, L. R. (1945). Measures of the amount of ecologic association betweenspecies. Ecology, 26, 297–302.

Dolz, J., Desrosiers, C., & Ayed, I. B. (2017). 3D fully convolutional networksfor subcortical segmentation in MRI: A large-scale study. NeuroImage, .

Filippi, M., Rocca, M. A., Ciccarelli, O., De Stefano, N., Evangelou, N.,Kappos, L., Rovira, A., Sastre-Garriga, J., Tintore, M., Frederiksen, J. L.,Gasperini, C., Palace, J., Reich, D. S., Banwell, B., Montalban, X., &Barkhof, F. (2016). MRI criteria for the diagnosis of multiple sclerosis:MAGNIMS consensus guidelines. The Lancet Neurology, 15, 292 – 303.

Ghafoorian, M., Karssemeijer, N., Heskes, T., van Uden, I., Sanchez, C., Lit-jens, G., de Leeuw, F.-E., van Ginneken, B., Marchiori, E., & Platel, B.(2017). Location sensitive deep convolutional neural networks for segmen-tation of white matter hyperintensities. Nature Scientific Reports, 7.

Guerrero, R., Qin, C., Oktay, O., Bowles, C., Chen, L., Joules, R., Wolz, R.,Valdes-Hernandez, M., Dickie, D., Wardlaw, J. et al. (2017). White matterhyperintensity and stroke lesion segmentation and differentiation using con-volutional neural networks. CoRR, abs/1706.00935. arXiv:1706.00935.

Hariharan, B., Arbelaez, P., Girshick, R., & Malik, J. (2015). Hypercolumnsfor object segmentation and fine-grained localization. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (pp. 447–456).

Kamnitsas, K., Ledig, C., Newcombe, V. F., Simpson, J. P., Kane, A. D.,Menon, D. K., Rueckert, D., & Glocker, B. (2017). Efficient multi-scale

9

Page 10: Abstract arXiv:1801.06457v2 [cs.CV] 19 Feb 2018 · ferred in the literature as 2.5D (Lyksborg et al., 2015; Biren-baum & Greenspan, 2016; Kushibar et al., 2017). With ad-vances in

3D CNN with fully connected CRF for accurate brain lesion segmentation.Medical Image Analysis, 36, 61–78.

Kapur, T., Grimson, W. E. L., Wells, W. M., & Kikinis, R. (1996). Segmentationof brain tissue from magnetic resonance images. Medical Image Analysis,1, 109–127.

Kleesiek, J., Urban, G., Hubert, A., Schwarz, D., Maier-Hein, K., Bendszus,M., & Biller, A. (2016). Deep MRI brain extraction: a 3D convolutionalneural network for skull stripping. NeuroImage, 129, 460–469.

Kushibar, K., Valverde, S., Gonzalez-Villa, S., Bernal, J., Cabezas, M.,Oliver, A., & Llado, X. (2017). Automated sub-cortical brain structuresegmentation combining spatial and deep convolutional features. CoRR,abs/1709.09075. arXiv:1709.09075.

Liew, A. W.-C., & Yan, H. (2006). Current methods in the automatic tissue seg-mentation of 3D magnetic resonance brain images. Current medical imagingreviews, 2, 91–103.

Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian,M., van der Laak, J. A., van Ginneken, B., & Sanchez, C. I. (2017). Asurvey on deep learning in medical image analysis. CoRR, abs/1702.05747.arXiv:1702.05747.

Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks forsemantic segmentation. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition (pp. 3431–3440).

Lyksborg, M., Puonti, O., Agn, M., & Larsen, R. (2015). An ensemble of2D convolutional neural networks for tumor segmentation. In ScandinavianConference on Image Analysis (pp. 201–211). Springer.

Mendrik, A. M., Vincken, K. L., Kuijf, H. J., Breeuwer, M., Bouvy, W. H.,De Bresser, J., Alansary, A., De Bruijne, M., Carass, A., El-Baz, A. et al.(2015). MRBrainS challenge: online evaluation framework for brain im-age segmentation in 3T MRI scans. Computational intelligence and neuro-science, 2015, 1.

Milletari, F., Navab, N., & Ahmadi, S.-A. (2016). V-net: Fully convolutionalneural networks for volumetric medical image segmentation. In on 3D Vi-sion (3DV), 2016 Fourth International Conference (pp. 565–571).

Moeskops, P., Veta, M., Lafarge, M. W., Eppenhof, K. A., & Pluim, J. P. (2017).Adversarial training and dilated convolutions for brain MRI segmentation.In Deep Learning in Medical Image Analysis and Multimodal Learning forClinical Decision Support (pp. 56–64). Springer.

Moeskops, P., Viergever, M. A., Mendrik, A. M., de Vries, L. S., Benders, M. J.,& Isgum, I. (2016). Automatic segmentation of MR brain images with aconvolutional neural network. IEEE Transactions on Medical Imaging, 35,1252–1261.

Nie, D., Wang, L., Gao, Y., & Sken, D. (2016). Fully convolutional networks formulti-modality isointense infant brain image segmentation. In BiomedicalImaging (ISBI), 2016 IEEE 13th International Symposium (pp. 1342–1345).

Pham, D. L. (2001). Robust fuzzy segmentation of magnetic resonance im-ages. In Proceedings 14th IEEE Symposium on Computer-Based MedicalSystems. CBMS 2001 (pp. 127–131).

Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networksfor biomedical image segmentation. In International Conference on Med-ical Image Computing and Computer-Assisted Intervention (pp. 234–241).Springer.

Rovira, A., Wattjes, M. P., Tintore, M., Tur, C., Yousry, T. A., Sormani, M. P.,De Stefano, N., Filippi, M., Auger, C., Rocca, M. A. et al. (2015). Evidence-based guidelines: MAGNIMS consensus guidelines on the use of MRI inmultiple sclerosis - clinical implementation in the diagnostic process. NatureReviews Neurology, 11, 471–482.

Shakeri, M., Tsogkas, S., Ferrante, E., Lippe, S., Kadoury, S., Paragios, N.,& Kokkinos, I. (2016). Sub-cortical brain structure segmentation using F-CNN’s. In Biomedical Imaging (ISBI), 2016 IEEE 13th International Sym-posium (pp. 269–272).

Shattuck, D. W., Sandor-Leahy, S. R., Schaper, K. A., Rottenberg, D. A., &Leahy, R. M. (2001). Magnetic resonance image tissue classification usinga partial volume model. NeuroImage, 13, 856–876.

Steenwijk, M. D., Geurts, J. J. G., Daams, M., Tijms, B. M., Wink, A. M.,Balk, L. J., Tewarie, P. K., Uitdehaag, B. M. J., Barkhof, F., Vrenken, H.et al. (2016). Cortical atrophy patterns in multiple sclerosis are non-randomand clinically relevant. Brain, 139, 115–126.

Stollenga, M. F., Byeon, W., Liwicki, M., & Schmidhuber, J. (2015). ParallelMulti-dimensional LSTM, with Application to Fast Biomedical VolumetricImage Segmentation. In Proceedings of the 28th International Conferenceon Neural Information Processing Systems NIPS’15 (pp. 2998–3006). Cam-

bridge, MA, USA: MIT Press.Valverde, S., Oliver, A., Cabezas, M., Roura, E., & Llado, X. (2015). Compar-

ison of 10 brain tissue segmentation methods using revisited IBSR annota-tions. Journal of Magnetic Resonance Imaging, 41, 93–101.

Valverde, S., Oliver, A., Roura, E., Gonzalez-Villa, S., Pareto, D., Vilanova,J. C., Ramio-Torrenta, L., Rovira, A., & Llado, X. (2017). Automated tissuesegmentation of MR brain images in the presence of white matter lesions.Medical Image Analysis, 35, 446–457.

Zhang, W., Li, R., Deng, H., Wang, L., Lin, W., Ji, S., & Shen, D. (2015). Deepconvolutional neural networks for multi-modality isointense infant brain im-age segmentation. NeuroImage, 108, 214 – 224.

Zhang, Y., Brady, M., & Smith, S. (2001). Segmentation of brain MR im-ages through a hidden Markov random field model and the expectation-maximization algorithm. IEEE Transactions on Medical Imaging, 20, 45–57.

10