12
Technical Note Does feature selection improve classication accuracy? Impact of sample size and feature selection on classication using anatomical magnetic resonance images Carlton Chu a, 1 , Ai-Ling Hsu b, 1 , Kun-Hsien Chou c , Peter Bandettini a , ChingPo Lin b, c, and for the Alzheimer's Disease Neuroimaging Initiative 2 a Section on Functional Imaging Methods, Laboratory of Brain and Cognition, NIMH, NIH, Bethesda, USA b Institute of Brain Science, National Yang Ming University, Taipei, Taiwan, ROC c Brain Connectivity Laboratory, Institute of Neuroscience, National Yang Ming University, Taipei, Taiwan, ROC abstract article info Article history: Received 1 August 2011 Revised 15 November 2011 Accepted 21 November 2011 Available online 1 December 2011 Keywords: Feature selection Support vector machine (SVM) Diseases classication Recursive feature elimination (RFE) There are growing numbers of studies using machine learning approaches to characterize patterns of ana- tomical difference discernible from neuroimaging data. The high-dimensionality of image data often raises a concern that feature selection is needed to obtain optimal accuracy. Among previous studies, mostly using xed sample sizes, some show greater predictive accuracies with feature selection, whereas others do not. In this study, we compared four common feature selection methods. 1) Pre-selected region of inter- ests (ROIs) that are based on prior knowledge. 2) Univariate t-test ltering. 3) Recursive feature elimination (RFE), and 4) t-test ltering constrained by ROIs. The predictive accuracies achieved from different sample sizes, with and without feature selection, were compared statistically. To demonstrate the effect, we used grey matter segmented from the T1-weighted anatomical scans collected by the Alzheimer's disease Neuro- imaging Initiative (ADNI) as the input features to a linear support vector machine classier. The objective was to characterize the patterns of difference between Alzheimer's disease (AD) patients and cognitively normal subjects, and also to characterize the difference between mild cognitive impairment (MCI) patients and nor- mal subjects. In addition, we also compared the classication accuracies between MCI patients who con- verted to AD and MCI patients who did not convert within the period of 12 months. Predictive accuracies from two data-driven feature selection methods (t-test ltering and RFE) were no better than those achieved using whole brain data. We showed that we could achieve the most accurate characterizations by using prior knowledge of where to expect neurodegeneration (hippocampus and parahippocampal gyrus). Therefore, feature selection does improve the classication accuracies, but it depends on the method adopted. In gener- al, larger sample sizes yielded higher accuracies with less advantage obtained by using knowledge from the existing literature. © 2011 Elsevier Inc. All rights reserved. 1. Introduction Common questions that investigators have about analyzing image data often concern how best to represent it. For morphometric analy- sis, it is common for researchers to ask whether they should use voxel-based morphometry, cortical thickness mapping, or any one of a vast number of other possible representations. No one approach is optimal in all situations. One that works well for differentiating among some populations may not work so well when used for others. The empirical way to determine which is betterinvolves assessing how well the various representations are able to actually discriminate among the populations. Studies that apply predictive and classica- tion models are based on the predictive validity (Forster, 2002). If the classication accuracy is higher than chance, it implies that the input features (i.e. could be the whole brain or a small brain region) contain some information to discriminate among the populations (Ashburner and Kloppel, 2011). There is increasing interest in apply- ing multivariate pattern analysis (MVPA) to study the patterns of neurodegenerative diseases and mental disorders using anatomical MRI, including Alzheimer's diseases (AD), Huntington's Diseases (HD), major depression disorders (MDD), schizophrenia, autism spectrum disorder (ASD) (Chu, 2009; Costafreda et al., 2009; Eger et al., 2008; Fan et al., 2005; Kloppel et al., 2009; Kloppel et al., 2008), and many other disorders or diseases, which had been applied to NeuroImage 60 (2012) 5970 Corresponding author at: Institute of Neuroscience, National Yang-Ming Universi- ty, 155, Nong st., Sec. 2, Peitou, Taipei, Taiwan, ROC. Fax: +886 2 28262285. E-mail address: [email protected] (C. Lin). 1 These authors contributed equally to this work. 2 Data used in the preparation of this article were obtained from the Alzheimer's Dis- ease Neuroimaging Initiative (ADNI) database (http://www.loni.ucla.edu/ADNI). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this re- port. ADNI investigators include (complete listing available at: http://adni.loni.ucla. edu/wp-content/uploads/how_to_apply/ADNI_Authorship_List.pdf). 1053-8119/$ see front matter © 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.neuroimage.2011.11.066 Contents lists available at SciVerse ScienceDirect NeuroImage journal homepage: www.elsevier.com/locate/ynimg

Does feature selection improve classification accuracy? Impact of …adni.loni.usc.edu/adni-publications/Chu-NeuroImage-2012.pdf · 2019-06-04 · Technical Note Does feature selection

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Does feature selection improve classification accuracy? Impact of …adni.loni.usc.edu/adni-publications/Chu-NeuroImage-2012.pdf · 2019-06-04 · Technical Note Does feature selection

NeuroImage 60 (2012) 59–70

Contents lists available at SciVerse ScienceDirect

NeuroImage

j ourna l homepage: www.e lsev ie r .com/ locate /yn img

Technical Note

Does feature selection improve classification accuracy? Impact of sample size andfeature selection on classification using anatomical magnetic resonance images

Carlton Chu a,1, Ai-Ling Hsu b,1, Kun-Hsien Chou c, Peter Bandettini a, ChingPo Lin b,c,⁎and for the Alzheimer's Disease Neuroimaging Initiative 2

a Section on Functional Imaging Methods, Laboratory of Brain and Cognition, NIMH, NIH, Bethesda, USAb Institute of Brain Science, National Yang Ming University, Taipei, Taiwan, ROCc Brain Connectivity Laboratory, Institute of Neuroscience, National Yang Ming University, Taipei, Taiwan, ROC

⁎ Corresponding author at: Institute of Neuroscience,ty, 155, Nong st., Sec. 2, Peitou, Taipei, Taiwan, ROC. Fax

E-mail address: [email protected] (C. Lin).1 These authors contributed equally to this work.2 Data used in the preparation of this article were obta

ease Neuroimaging Initiative (ADNI) database (http:/such, the investigators within the ADNI contributed to tof ADNI and/or provided data but did not participate inport. ADNI investigators include (complete listing avaedu/wp-content/uploads/how_to_apply/ADNI_Authorsh

1053-8119/$ – see front matter © 2011 Elsevier Inc. Alldoi:10.1016/j.neuroimage.2011.11.066

a b s t r a c t

a r t i c l e i n f o

Article history:Received 1 August 2011Revised 15 November 2011Accepted 21 November 2011Available online 1 December 2011

Keywords:Feature selectionSupport vector machine (SVM)Diseases classificationRecursive feature elimination (RFE)

There are growing numbers of studies using machine learning approaches to characterize patterns of ana-tomical difference discernible from neuroimaging data. The high-dimensionality of image data often raisesa concern that feature selection is needed to obtain optimal accuracy. Among previous studies, mostlyusing fixed sample sizes, some show greater predictive accuracies with feature selection, whereas othersdo not. In this study, we compared four common feature selection methods. 1) Pre-selected region of inter-ests (ROIs) that are based on prior knowledge. 2) Univariate t-test filtering. 3) Recursive feature elimination(RFE), and 4) t-test filtering constrained by ROIs. The predictive accuracies achieved from different samplesizes, with and without feature selection, were compared statistically. To demonstrate the effect, we usedgrey matter segmented from the T1-weighted anatomical scans collected by the Alzheimer's disease Neuro-imaging Initiative (ADNI) as the input features to a linear support vector machine classifier. The objective wasto characterize the patterns of difference between Alzheimer's disease (AD) patients and cognitively normalsubjects, and also to characterize the difference between mild cognitive impairment (MCI) patients and nor-mal subjects. In addition, we also compared the classification accuracies between MCI patients who con-verted to AD and MCI patients who did not convert within the period of 12 months. Predictive accuraciesfrom two data-driven feature selection methods (t-test filtering and RFE) were no better than those achievedusing whole brain data. We showed that we could achieve the most accurate characterizations by using priorknowledge of where to expect neurodegeneration (hippocampus and parahippocampal gyrus). Therefore,feature selection does improve the classification accuracies, but it depends on the method adopted. In gener-al, larger sample sizes yielded higher accuracies with less advantage obtained by using knowledge from theexisting literature.

© 2011 Elsevier Inc. All rights reserved.

1. Introduction

Common questions that investigators have about analyzing imagedata often concern how best to represent it. For morphometric analy-sis, it is common for researchers to ask whether they should usevoxel-based morphometry, cortical thickness mapping, or any oneof a vast number of other possible representations. No one approach

National Yang-Ming Universi-: +886 2 28262285.

ined from the Alzheimer's Dis-/www.loni.ucla.edu/ADNI). Ashe design and implementationanalysis or writing of this re-ilable at: http://adni.loni.ucla.ip_List.pdf).

rights reserved.

is optimal in all situations. One that works well for differentiatingamong some populations may not work so well when used for others.The empirical way to determine which is “better” involves assessinghowwell the various representations are able to actually discriminateamong the populations. Studies that apply predictive and classifica-tion models are based on the predictive validity (Forster, 2002). Ifthe classification accuracy is higher than chance, it implies that theinput features (i.e. could be the whole brain or a small brain region)contain some information to discriminate among the populations(Ashburner and Kloppel, 2011). There is increasing interest in apply-ing multivariate pattern analysis (MVPA) to study the patterns ofneurodegenerative diseases and mental disorders using anatomicalMRI, including Alzheimer's diseases (AD), Huntington's Diseases(HD), major depression disorders (MDD), schizophrenia, autismspectrum disorder (ASD) (Chu, 2009; Costafreda et al., 2009; Eger etal., 2008; Fan et al., 2005; Kloppel et al., 2009; Kloppel et al., 2008),and many other disorders or diseases, which had been applied to

Page 2: Does feature selection improve classification accuracy? Impact of …adni.loni.usc.edu/adni-publications/Chu-NeuroImage-2012.pdf · 2019-06-04 · Technical Note Does feature selection

Table 1Demographic of the dataset.

AD MCI NC

Size 131 261 188Female/male 63/68 86/174 96/92Age (Mean±SD) 78.53±4.91 77.92±4.97 76.51±4.58*MMSE (Mean±SD) 23.24±2.08 26.98±1.81 29.16±0.96*CDR 0.5, 1.0 0.5 0Memory box score 1.0 0.5 0

*MMSE: Mini-Mental State Examination (Scores 0–30) AD(20–26); MCI,NC(24–30).*CDR: Clinical Dementia Rating (0.5 — very mind, 1 — mild, 2 —moderate, 3 — severe).

60 C. Chu et al. / NeuroImage 60 (2012) 59–70

Voxel Based Morphometry (VBM), especially studies on classificationof AD have grown significantly in recent years.

The high dimensionality of the input feature space in comparisonwith the relatively small number of subjects (curse of dimensionality)is a widespread concern, so some form of feature selection is often ap-plied (Costafreda et al., 2009; Fan et al., 2007; Guyon and Elisseeff,2003; Pelaez-Coca et al., 2010). However, many studies have usedsupport vector machine (SVM) and other kernel methods. These ap-proaches search for a solution in the kernel space, so the optimizedparameters are bounded by the number of training samples, ratherthan the dimension of the input features. In other words, kernelmethods implicitly reduce the dimensionality of the problem to thatof the number of samples (subjects in this case). Besides, neuroimag-ing data exhibit high correlations among features, implying a muchlower “effective dimensionality” than the actual number of voxels,and samples (images) are likely to occur on a lower-dimensionalmanifold. Nevertheless, feature selection does have benefits otherthan alleviating the effect of the curse of dimensionality. One practicalbenefit is that it speeds up the testing process. Some also claim that itcan make interpretation easier. It is often simpler to explain resultsfrom those subsets of the data that are sufficient to maintain equiva-lent classification accuracies (Marquand et al., 2011). The main bene-fit claimed for feature selection, which is the main focus in thismanuscript, is that it increases classification accuracy. It is believedthat removing non-informative signal can reduce noise, and can in-crease the contrast between labelled groups. There are some re-searchers who think it is necessary to apply feature selection. Somereviewers also make the papers difficult to publish if some feature se-lection has not been used. However, there is little evidence to supportthe claim that feature selection is always necessary.

Approaches of feature selection have shown many promising re-sults in the field of bioinformatics (Saeys et al., 2007). A lot of featureselection methods and techniques originated or were specifically de-veloped for bioinformatics. However, the properties of input featuresin bioinformatics are very different from features in neuroimagingdata, especially as there are often more independent features in bioin-formatics data. Among previous studies of diseases classificationusing imaging data, mostly using a fixed sample size, some showhigher classification accuracies with feature selection. However, a re-cent study (Cuingnet et al., 2010) using 509 subjects showed no accu-racy improvement with feature selection. We hypothesize that anyimprovement due to feature selection depends on sample size andthe prior knowledge that the feature selection method utilizes. Apriori, we believe that when the training sample is sufficiently large,feature selection would have a tiny, or even negative, impact on per-formance. Based on Occam's razor (the principle of parsimony) weshould prefer the simpler method when the performance is equiva-lent to more sophisticated ones. That is to say, perhaps it is preferablenot to apply feature selection when no improvements of classificationperformance are shown.

To test our hypothesis, we compared the accuracy of discriminat-ing AD from normal controls (NC) and discriminating mild cognitiveimpairment (MCI) from NC using four common feature selectionmethods and different training set sizes. The imaging data camefrom the Alzheimer's disease Neuroimaging Initiative (ADNI),which have been used by many studies of AD classification(Filipovych and Davatzikos, 2011; Gerardin et al., 2009; Pelaez-Coca et al., 2011; Stonnington et al., 2010; Yang et al., 2011; Zhanget al., 2011). To obtain the maximum amount of training data of thesame modality, we used a single T1-weighted image from each sub-ject. The pre-processing pipeline was the same as is often used invoxel-based morphometry (VBM) studies, and involved usingJacobian-scaled, spatially normalized gray matter (GM) as theinput features. Cross-validation and resampling techniques were ap-plied to estimate the accuracies. The performance, with and withoutfeature selection, was compared statistically.

2. Methods

This section describes the methods of feature selection and ma-chine learning applied to the ADNI data, and the pre-processing ofthe image data.

2.1. ADNI dataset

Data used in the preparation of this article were obtained from theAlzheimer's disease Neuroimaging Initiative (ADNI) database (http://www.loni.ucla.edu/ADNI), which was launched in 2003 by the Na-tional Institute on Aging (NIA), the National Institute of BiomedicalImaging and Bioengineering (NIBIB), the Food and Drug Administra-tion (FDA), private pharmaceutical companies and non-profitorganizations. The primary goal of ADNI has been to test whetherserial MRI, positron emission tomography (PET), other biologicalmarkers and clinical and neuropsychological assessment can becombined to measure the progression of amnestic MCI and earlyprobable AD. The determination of sensitive and specific biomarkersof early AD and disease progression is intended to aid researchersand clinicians to develop new treatments and monitor theireffectiveness, as well as lessen the time and cost of clinical trials.

Although ADNI includes various image modalities, we only used themore abundant T1-weighted images obtained using MPRAGE or equiv-alent protocols with varying resolutions (typically 1.25×1.25 mm in-plane spatial resolution and 1.2 mm thick sagittal slices). Only imagesobtained using 1.5 T scanners were used in this study, and we usedthe first time point if there are multiple images of the same subject ac-quired at different times. The sagittal images were pre-processedaccording to a number of steps detailed under the ADNI website,which corrected for field inhomogeneities and image distortion, andwere resliced to axial orientation. We visually inspected the graymatter (GM) map of all the subjects (native space) after tissuesegmentation. Images that could not be segmented correctly wereremoved (list of removed subjects are in the Supplementarymaterial). The dataset, which we used in the comparisons of featureselection, contained 260 MCI (86 females, 174 males), 188 NC (96females, 92 males), and 131 AD (63 females, 68 males) (Table 1). Thelist of subjects used in this manuscript can be found in theSupplementary material.

In addition, a subset of MCI patients who had at least three longi-tudinal scans (baseline image, and two subsequent images sixmonths and twelve months after) was selected. The converterswere defined as subjects who had MMSE scores between 20 and 26(inclusive), and CDR of 0.5 or 1.0 in the second or the third scan.The non-converters were defined as subjects who had MMSE scoresbetween 24 and 30 (inclusive), and CDR of 0.5 in all three scans.These criteria are consistent with the criteria of AD and MCI in ADNI.

2.2. Image pre-processing

All individual T1-weighted images were pre-processed with SPM8(http://www.fil.ion.ucl.ac.uk/spm) running under a Linux MATLAB7.10 platform (The MathWorks, Natick, MA, USA). Before tissue

Page 3: Does feature selection improve classification accuracy? Impact of …adni.loni.usc.edu/adni-publications/Chu-NeuroImage-2012.pdf · 2019-06-04 · Technical Note Does feature selection

Table 2C values and the corresponding feature size.

Number of voxels C1 C2 C3 C4 C5 C6

Whole brain 2−12 2−14 2−16 2−18 2−20 2−22

82,260 2−9 2−11 2−13 2−15 2−17 2−19

71,229 2−9 2−11 2−13 2−15 2−17 2−19

25,355 2−7 2−9 2−11 2−13 2−15 2−17

12,754 2−5 2−7 2−9 2−11 2−13 2−15

11,031 2−5 2−7 2−9 2−11 2−13 2−15

9542 2−5 2−7 2−9 2−11 2−13 2−15

7522 2−5 2−7 2−9 2−11 2−13 2−15

6463 2−4 2−6 2−8 2−10 2−12 2−14

4568 2−3 2−5 2−7 2−9 2−11 2−13

Table 3ROI selected.

ROI Full name Abbreviation Number ofvoxels

1 Cingulate gyrus CG 12,7542 Hippocampus HC 45683 Parahippocampal gyrus PHG 64634 Fusiform gyrus FG 95425 Precuneus PCu 75226 Middle temporal gyrus MTG 25,3557 Temporal lobe TL 71,2298 Hippocampus+parahippocampal gyrus HC+PHG 11,0319 Temporal lobe+hippocampus+

parahippocampal gyrusTL+HC+PHG

82,260

10 Whole brain All 299,477

61C. Chu et al. / NeuroImage 60 (2012) 59–70

segmentation, the images were re-oriented such that the origin oftheir coordinate systems was manually set to be close to the anteriorcommissure. These images were then segmented into grey matter(GM) and white matter (WM) maps in the “realigned” space usingthe “new segment” toolbox with default settings. “New segmenta-tion” is an extension of the conventional “unified segmentation”(Ashburner and Friston, 2005). It provides more accurate tissue seg-mentation and deals with voxels outside the brain by includingextra tissue probability maps (skull, scalp, etc.). The realigned GMand WM maps were also converted to 1.5 mm isotropic. Next, theinter-subject alignment was refined using the Dartel toolbox(Ashburner, 2007). Briefly, this involves creating averages of the GMand WM images of all subjects, which is used as an initial template.Images were then warped into closer alignment with this template,and a new template constructed by averaging the warped images(Ashburner and Friston, 2009). This procedure was repeated for afixed number of iterations, using all the default settings. The GM im-ages were warped into alignment with the average-shaped templateusing the estimated deformations. Total GM volume was preservedby multiplying each voxel of the warped GM with the Jacobian deter-minants of the deformation field (a.k.a. “modulation”). No smoothingwas applied as we found no significant improvement of classificationaccuracy using whole brain GM with smoothing. This finding alsoconfirmed the previous study (Kloppel et al., 2008). To avoid possibleedge effects between tissue boundaries, we excluded voxels in theGM population template that had values less than 0.2. These masked,warped and modulated images, which have 299,477 voxels in eachimage, served as the input features to the following feature selectionand classification procedures.

2.3. Support vector machine (SVM)

The support vector machine (SVM) is one of the more popular su-pervised learning algorithms that have been applied to neoroimagingdata in recent years (Craddock et al., 2009; Fan et al., 2005; Kloppel etal., 2008; Mourao-Miranda et al., 2005). SVM demonstrates good clas-sification performance, and is computationally efficient for trainingwith high dimensional data. SVM is also known as the maximummargin classifier (Cristianini and Shawe-Taylor, 2000). The frame-work of SVM was motivated by statistical learning theory (Vapnik,1995), in which the decision boundary is selected to maximize theseparation between two classes rather than minimizing the empiricalloss. It is a sparse method, which means the decision boundary maybe determined by a subset of training samples, called the support vec-tors (SV). Intuitively, SVM ignores trivial training samples, which arefar away from the opposite class, and the SV are the samples whichare closest to the decision boundaries (for hard-margin SVM). Theequations of SVM can be found in many literatures including Wikipe-dia, therefore we do not repeat in this manuscript. In this study, weused both the hard-margin SVM and soft-margin SVM implementa-tion in LIBSVM (Chang and Lin, 2001). The difference between soft-margin and hard-margin SVM is that soft-margin SVM penalizes thetraining error, and the optimization in the primal form tries to maxi-mize the margin and minimize the training errors. The free parameterC controls the weighting of the errors, it can also be realized as the in-verse of regularization (smaller C, more regularization, vice versa). Inthe dual form, the objective functions in both soft-margin and hard-margin are the same, except C is the upper bound of the Lagrangemultipliers for soft-margin SVM. In other words, if the C in soft-margin SVM is larger or equal to the largest Lagrange multiplier inthe hard-margin SVM, both formulations would yield the same solu-tion. This is an important issue, as some studies that claimed using asoft-margin SVM, however, the C values may be large enough to re-sult a hard-margin SVM training. The effect of C value also dependson the number of features and the scale of the features. When morefeatures are used or when the scales of the features increase, the C

value would need to decrease to achieve the same regularization. InTable 2, we show the C values used in this study and the correspond-ing number of voxels. The C values are all powers of 2, and the powerof the maximum C value (C0) for each corresponding number of vox-els was defined by rounding down the log of base two of the maxi-mum Lagrange multiplier from training AD vs. NC using the ROIwith the corresponding voxels with the whole dataset. The subse-quent C value is computed by dividing the previous C value. (i.e.C1=C0/4, C2=C1/4, etc.)

SVM is a kernel algorithm that means the input of the algorithm isa kernel rather than the original input features. For an input matrix,X=[x1, x2, …, xn]T, each row of X is one sample with d features,and the linear kernel matrix is defined byK=XXT. Alternatively, wecan also formulate the kernel as a linear combination of sub-kernels,K=K1+K2+…Kd. Each sub-kernel is computed from each feature(voxel) individually. In this perspective, feature selection can beviewed as finding the optimal combination of kernels or a binary ver-sion of multi-kernel learning (MKL).

2.4. Feature selection methods

The goal of feature selection is to select a subset of features (vox-els) from the original input features. We applied four different featureselection methods. One is based on using prior knowledge about re-gions of brain atrophy found in previous studies, and two methodsare purely data-driven approaches. We also created a hybrid method,which combined the known parcellation of anatomical regions andthe data-driven approach. To compare the feature selection methods,we selected equal numbers of features in all methods with differentlevel of selected features, except the hybrid method. The methodsare described in the following:

1. Atlas-based region of interest (ROI). The ROIs are defined by theLONI Probabilistic Brain Atlas (LPBA40) (http://www.loni.ucla.edu/Atlases/LPBA40) (Shattuck et al., 2008), which were generat-ed from anatomic regions of MRI delineated manually from 40subjects. The delineated structures were all transformed into the

Page 4: Does feature selection improve classification accuracy? Impact of …adni.loni.usc.edu/adni-publications/Chu-NeuroImage-2012.pdf · 2019-06-04 · Technical Note Does feature selection

Fig. 1. Overlay of the selected ROIs from the affine-registered maximum likelihood map in LPBA40 onto the population grey matter average.

62 C. Chu et al. / NeuroImage 60 (2012) 59–70

atlas space. We used the SPM5 space, which is the template spacein SPM5. There are a total of 56 structures in the atlas. Most regionsare left/right pairs e.g. left cuneus, right cuneus, except for cerebel-lum and brainstem. To generate the parcellation, we converted theLPBA40 atlas into an image of discrete regions, where each voxelwas assigned a label corresponding to the most probable of the56 structures. We combined both left and right structures intoone, hence reducing the total number of ROIs to 27. Because thedata were spatially normalized into population average space (de-tails see Image pre-processing), the maximum likelihood map inLPBA40 were additionally affine-registered to the population aver-age, and resampled using nearest neighbor interpolation. Based onfindings from previous studies (Bastos Leite et al., 2004; Chetelat et

40%

45%

50%

55%

60%

65%

70%

75%

80%

85%

14 38 62 86 110 134 158 182 206 230 254

Accuracy

Sample Size

Classify AD & NC

Original Data

Von Bertalanffy function

2623

4

4

5

5

6

6

7

Fig. 2. Classification accuracies vs. sample size for classifying AD & NC and classifying Mf(x)=a×(1−exp(−b× x)).

al., 2002; Frisoni et al., 2002; Pennanen et al., 2005), we selectedthe nine ROIs listed in Table 3 (also see Fig. 1). Using knowledgefrom the literatures, we would expect hippocampus and parahip-pocampal gyrus to achieve the best performance. Note that hippo-campus defined in LPBA40 includes amygdale.

2. Univariate t-test filtering (t-test). This is one of the most commonlyused approaches. It is equivalent to performing a simple mass-univariate analysis (statistical parametric map) of the data, andthresholding at some value to obtain a mask. A two-sample t-testat each voxel of the training set was performed. Because trainingshould not be informed by the labels of the test set in any way,the test set was excluded from this mass-univariate analysis. Wethen determined various thresholds for the absolute t-value,

5%

0%

5%

0%

5%

0%

5%

0%

20 52 84 116 148 180 212 244 276 308 340

Accuracy

Sample Size

Classify MCI & NC

Original Data

Von Bertalanffy function

356

CI & NC using the whole brains. We also fitted a Von Bertalanffy growth function,

Page 5: Does feature selection improve classification accuracy? Impact of …adni.loni.usc.edu/adni-publications/Chu-NeuroImage-2012.pdf · 2019-06-04 · Technical Note Does feature selection

63C. Chu et al. / NeuroImage 60 (2012) 59–70

based on the number of voxels we intended to select at each level(shown in Table 3). In other words, a smaller set (higher thresh-old) was always a subset of a larger set (lower threshold). It canalso be seen as a combination of kernels computed from voxelsat different range of absolute t-values K=Kth1+Kth2+…Kthk. Asthe threshold lowering down, more kernels were added.

3. Averaged absolute t-value in ROIs (t-test+ROI). The method com-bines the standard univariate t-test filtering with pre-defied ROIsfrom the atlas. We first performed mass-univariate t-tests on thetraining samples, from which we computed the averaged absolutet-value in the 27 ROIs defined by LPBA40 (combining left and rightstructures). The ROIs were ranked according to their averaged ab-solute t-value. We selected the top 1, 2, 3, 4, and 5 ROIs, and therewere 5 levels of inclusion. i.e. Top one means using the 1st rankedROI, top 2 means the combination of the 1st and 2nd ranked ROI,and top 3 means the combination from the 1st to the 3rd rankedROI, etc. This was the only method that did not have a fixed num-ber of voxel at each level of selection, because different ROIs havedifferent numbers of voxels. For this, we only varied the “numberof combined ROIs”.

4. Recursive feature elimination (RFE). Recursive feature elimination isan iterative feature selection algorithm developed specifically forSVM (Guyon et al., 2002). It is also a popular feature selectionmethod for fMRI data (Craddock et al., 2009; De Martino et al.,2008; Marquand et al., 2010). The original algorithm removesone feature, which would potentially yield the largest margin in-crease at each iteration. For a linear kernel, it means the feature

40%

45%

50%

55%

60%

65%

70%

20 44 68 92 116 140 164 188 212 236 260 284 308 332 356

Accuracy

Sample size

All t_120000 t_100000 t_82260

t_71229 t_25355 t_12754 t_11031

t_9542 t_7522 t_6463 t_4568

MCI & NC: t-test40

45

50

55

60

65

70

40

45

50

55

60

65

70

75

80

85

90

40%

45%

50%

55%

60%

65%

70%

75%

80%

85%

90%

14 30 46 62 78 94 110 126 142 158 174 190 206 222 238254

Accuracy

Sample size

All t_120000 t_100000 t_82260

t_71229 t_25355 t_12754 t_11031

t_9542 t_7522 t_6463 t_4568

262

AD & NC: t-test

c d

a b

Fig. 3. Classification accuracies obtained from data-driven feature selection. The dark line isRFE achieved significantly better classification than the whole brain. The table of p-values cfrom averaging 200 resamples, and for RFE, the accuracies were calculated from averaging 5whole brain. That is why the lines of whole brain classification (black lines) are different b

with the lowest absolute feature weight. An approximation is usu-ally made for high-dimensional data, whereby a small set of fea-tures is removed at each iteration. In our implementation, weremoved 3000 voxels (around 1%) at each iteration, sometimesadjusting this number to obtain the same number of features asused by the other methods (i.e. the number of voxels in Table 3).

2.5. Cross-validation and resampling

To estimate the classification accuracies with different trainingsize, we first selected a subset of subjects from the whole dataset.To reduce the gender confound, we balanced the number of malesand females in both control and patient groups, such that both groupshad equal proportions of males and females. To reduce the age con-found, we further matched the age in each gender group (e.g.matched the age of male control and male patients, etc.). We notonly matched the mean of the age, we also tried to match the distri-bution in both populations. To achieve this, we first randomly select-ed a subset from the group with fewer samples (e.g. fewer samples infemale NC than female MCI). We then estimated the distribution ofage by smoothing the histogram (bin size 1 year) with Gaussiansmoothing with 3 year FWHM. The new samples were then drawnfrom the group with more samples using a variation of rejection sam-pling (Bishop, 2006).

After the subset was selected, we applied leave-one-out cross-validation (loocv) to the subset for the hard-margin SVM. In eachcross-validation trial, one sample was left out, and we applied feature

%

%

%

%

%

%

%

20 44 68 92 116 140 164 188 212 236 260 284 308 332 356

Accuracy

Sample size

All R_120000 R_100000 R_82260

R_71229 R_25355 R_12754 R_11031

R_9542 R_7522 R_6463 R_4568

MCI & NC: RFE

%

%

%

%

%

%

%

%

%

%

%

14 30 46 62 78 94 110 126 142 158 174 190 206 222 238 254

Accuracy

Sample size

All R_120000 R_100000 R_82260

R_71229 R_25355 R_12754 R_11031

R_9542 R_7522 R_6463 R_4568

262

AD & NC: RFE

the accuracy obtained using the whole brain (no feature selection). Neither t-test noran be found in the Supplementary material. For t-tests, the accuracies were calculated0 resamples. To be compatible, we also averaged different numbers of resamples for theetween (a) and (b), and are different between (c) and (d).

Page 6: Does feature selection improve classification accuracy? Impact of …adni.loni.usc.edu/adni-publications/Chu-NeuroImage-2012.pdf · 2019-06-04 · Technical Note Does feature selection

SS / ROI

CG HC PHG FG PCu MTG TLHC

+PHGTL+HC+PHG

262 -1 0 1 -1 -1 -1 -1 1 -1254 -1 0 1 -1 -1 -1 -1 1 0

246 -1 0 1 -1 -1 -1 -1 1 -1

238 -1 0 1 -1 -1 -1 -1 1 0

230 -1 0 1 -1 -1 -1 -1 1 0222 -1 0 1 -1 -1 -1 -1 1 -1214 -1 0 1 -1 -1 -1 -1 1 -1206 -1 0 1 -1 -1 -1 -1 1 0198 -1 0 1 -1 -1 -1 -1 1 0190 -1 0 1 -1 -1 -1 -1 1 0182 -1 0 1 -1 -1 -1 -1 1 0174 -1 0 1 -1 -1 -1 -1 1 0166 -1 1 1 -1 -1 -1 -1 1 0158 -1 1 1 -1 -1 -1 -1 1 0150 -1 1 1 -1 -1 -1 -1 1 0142 -1 1 1 -1 -1 -1 -1 1 0134 -1 1 1 -1 -1 -1 -1 1 0126 -1 1 1 -1 -1 -1 -1 1 0118 -1 1 1 -1 -1 -1 -1 1 0110 -1 1 1 -1 -1 -1 -1 1 0102 -1 1 1 -1 -1 -1 -1 1 094 -1 1 1 -1 -1 -1 -1 1 086 -1 1 1 -1 -1 -1 -1 1 078 -1 1 1 -1 -1 -1 -1 1 070 -1 1 1 -1 -1 -1 -1 1 062 -1 1 1 -1 -1 -1 -1 1 054 -1 1 1 -1 -1 -1 -1 1 0

46 -1 1 1 -1 -1 -1 -1 1 038 -1 1 1 -1 -1 -1 -1 1 030 -1 1 1 -1 -1 -1 0 1 122 -1 1 1 0 -1 -1 0 1 1

14 0 1 1 1 0 0 1 1 1

AD & NC : [ ROI whole brain(All) ]

40%

45%

50%

55%

60%

65%

70%

75%

80%

85%

90%

14 30 46 62 78 94 110 126 142 158 174 190 206 222 238 254

Accuracy

Sample size

All CG HC

PHG FG Pcu

MTG TL HC+PHG

TL+HC+PHG

262

P<0.001

AD & NC: ROI

Fig. 4. Graph on the left shows classification accuracies obtained using different ROIs for classifying AD and NC with different sample sizes (resample 200 times). The dark line is theaccuracy obtained by using the whole brain (no feature selection). Table on the right shows results from paired t-tests of classification accuracies between each ROI and whole brain(no feature selection). 1 indicates the ROI is significantly better than whole brain, 0 indicates not significantly better, and −1 indicates whole brain is significantly better than thatROI. The rows are different sample sizes and the columns are different ROIs. The tables of the original p-values are in the Supplementary material.

64 C. Chu et al. / NeuroImage 60 (2012) 59–70

selection to the remaining samples. After feature selection, we traineda linear hard margin SVM, using it to predict the label of the testingsample. The procedure was repeated until each sample was left outand tested once, and the loocv accuracy was calculated. To reduce

40%

45%

50%

55%

60%

65%

70%

20 44 68 92 116 140 164 188 212 236 260 284 30

Accuracy

Sam

All CG HC

PHG FG Pcu

MTG TL HC+PHG

TL+HC+PHG

MCI & NC: ROI

Fig. 5. Graph on the left shows classification accuracies obtained using different ROIs for clathe accuracy obtained using the whole brain (no feature selection). Table on the right showbrain (no feature selection). 1 indicates the ROI is significantly better than whole brain, 0than that ROI. The rows are different sample sizes and the columns are different ROIs. The

the variance of the estimated accuracy, we resampled the subsetsmultiple times for each subset size. We resampled 200 times forROI, t-test, and t-test+ROI. For RFE, we only resampled 50 times,due to the longer computation times required. We also varied the

8 332 356

ple Size

SS / ROI

CG HC PHG FG PCu MTG TLHG+ PHG

TL+HC+PHG

356 -1 -1 1 -1 -1 -1 -1 1 1348 -1 -1 1 -1 -1 -1 -1 1 1

340 -1 -1 1 -1 -1 -1 -1 1 1

332 -1 -1 1 -1 -1 -1 -1 1 1

324 -1 -1 1 -1 -1 -1 -1 1 1316 -1 -1 1 -1 -1 -1 -1 1 1308 -1 -1 1 -1 -1 -1 -1 1 1300 -1 -1 1 -1 -1 -1 -1 1 1292 -1 -1 1 -1 -1 -1 -1 1 1284 -1 -1 1 -1 -1 -1 -1 1 1276 -1 -1 1 -1 -1 -1 -1 1 1268 -1 -1 1 -1 -1 -1 -1 1 1260 -1 -1 1 -1 -1 -1 -1 1 1252 -1 -1 1 -1 -1 -1 -1 1 1244 -1 -1 1 -1 -1 -1 -1 1 1236 -1 -1 1 -1 -1 -1 -1 1 1228 -1 -1 1 -1 -1 -1 -1 1 0220 -1 -1 1 -1 -1 -1 -1 1 1212 -1 -1 1 -1 -1 -1 -1 1 1204 -1 -1 1 -1 -1 -1 -1 1 0196 -1 -1 1 -1 -1 -1 -1 1 1188 -1 0 1 -1 -1 -1 -1 1 0180 -1 0 1 -1 -1 -1 -1 1 1172 -1 0 1 -1 -1 -1 -1 1 0164 -1 -1 1 -1 -1 -1 -1 1 0156 -1 0 1 -1 -1 -1 -1 1 0148 -1 0 1 -1 -1 -1 -1 1 0

140 -1 0 1 -1 -1 -1 -1 1 0132 -1 0 1 -1 -1 -1 -1 1 0124 -1 0 1 -1 -1 -1 -1 1 0116 -1 0 1 -1 -1 -1 -1 1 0

108 -1 0 1 -1 -1 -1 -1 1 0

100 -1 0 1 -1 -1 -1 -1 1 092 -1 0 1 -1 -1 -1 -1 1 084 -1 0 1 -1 -1 -1 -1 1 076 -1 1 1 -1 -1 -1 -1 1 068 -1 1 1 -1 -1 -1 -1 1 060 -1 1 1 -1 -1 -1 -1 1 052 -1 1 1 -1 -1 -1 0 1 044 0 1 1 0 -1 -1 0 1 136 0 1 1 0 0 0 0 1 128 0 1 1 0 0 0 0 1 120 1 1 1 0 0 0 0 1 1

MCI & NC : [ ROI whole brain(All) ]

P<0.001

ssifying MCI and NC with different sample sizes (resample 200 times). The dark line iss results from paired t-tests of classification accuracies between each ROI and whole

indicates not significantly better, and −1 indicates whole brain is significantly bettertables of the original p-values are in the Supplementary material.

Page 7: Does feature selection improve classification accuracy? Impact of …adni.loni.usc.edu/adni-publications/Chu-NeuroImage-2012.pdf · 2019-06-04 · Technical Note Does feature selection

40%

45%

50%

55%

60%

65%

70%

75%

80%

85%

90%

14 30 46 62 78 94 110 126 142 158 174 190 206 222 238 254

Accuracy

Sample size

All ROI~1 ROI~2

ROI~3 ROI~4 ROI~5

262

AD & NC: t-test+ROI12345

262 1 -1 -1 -1 -1246 1 -1 -1 -1 -1230 1 -1 -1 -1 -1214 0 -1 -1 -1 -1198 1 -1 -1 -1 -1182 1 -1 -1 -1 -1166 1 0 -1 -1 -1150 1 0 -1 0 0134 1 0 0 0 0118 1 0 0 0 0102 1 0 0 0 094 1 0 0 0 086 1 1 0 0 078 0 1 1 0 070 0 1 0 0 062 0 1 1 0 054 1 1 1 0 046 0 1 1 1 138 0 1 1 1 130 1 1 1 1 122 1 1 1 1 1

SSNumberoftoprankedROIcombined

AD&NC:[t-test+ROI wholebrain(All)]

P<0.001

Fig. 6. Classification accuracies and the results of paired t-tests of the hybrid method (t-test+ROI) for classifying AD & NC. The dark line is the accuracy obtained using whole brain(no feature selection). ROI~1 means the top ROI, and ROI~2 means the combination of the top two ROIs, etc. The rows in the table are for different sample sizes, and the columns aredifferent numbers of top ranked ROIs combined.

65C. Chu et al. / NeuroImage 60 (2012) 59–70

size of the subset. For classification between AD and NC, the mini-mum size of the subset was 14 (seven in each group, three female,and four male), and the maximum size of the subset was 262. The in-crement between each size-varied subsets was eight subjects (twomale AD, two female AD, two male NC, and two female NC). However,different feature selection methods had different computational costs.For methods that required longer computation, we omitted somepopulation sizes. For discrimination between MCI and NC, the subsetsranged from 20 to 356 subjects, with increments of eight. Again, theincrements varied according to the computationally intensive natureof the feature selection approaches.

We have also compared the difference between loocv and 10 foldcross-validation. When we averaged more than 50 resamples, theresults were not significantly different between10 fold cross-validation and loocv, especially when the sample size was large.Details are in the Supplementary material. Because 10 fold cross-validation and loocv have minimum difference with large

40%

45%

50%

55%

60%

65%

70%

20 44 68 92 116 140 164 188 212 236 26

Accuracy

All ROI~1 ROI~2

ROI~3 ROI~4 ROI~5

MCI & NC: t-test+ROI

Fig. 7. Classification accuracies and the results of paired t-tests for the hybrid method (t-tebrain (no feature selection). ROI~1 means the top ROI, and ROI~2 means the combinationand the columns are different numbers of top ranked ROIs combined.

resamples, we applied 10 fold cross-validation to the soft-marginSVM to accelerate the computation.

2.6. Statistical analysis

One of the objectives of this study is to examine whether featureselection could improve the classification accuracies. Because weused the same resampled sets for each sample size across differentfeature selection methods, we could compare the accuracies ofloocv, with and without the use of feature selection, using pairedt-test.

3. Results

Regardless of feature selection method, larger sample sizesyielded better performance. For classifying AD and NC, whole brainwithout feature selection achieved an average accuracy of 84.3% at

0 284 308 332 356

Sample size

1 2 3 4 5356 -1 -1 -1 0 1332 -1 -1 -1 0 1308 -1 -1 0 0 1292 -1 -1 0 0 1276 -1 -1 0 0 1260 -1 -1 0 0 1244 -1 -1 0 0 0228 -1 -1 0 0 0212 -1 -1 0 0 0196 -1 -1 0 0 0180 -1 0 0 0 0164 -1 -1 0 0 0148 -1 0 0 0 0132 0 0 0 0 0116 0 0 0 0 0100 0 0 0 0 092 0 0 0 0 084 0 0 0 0 076 0 0 1 0 068 0 0 0 0 060 0 0 0 0 0

52 0 0 0 0 044 0 0 0 0 036 0 0 0 1 028 0 0 0 0 0

MCI & NC :[t-test+ROI whole brain(All)]

Number of top ranked ROI combinedSS

P<0.001

st+ROI) for classifying MCI & NC. The dark line is the accuracy obtained using wholeof the top ranking two ROIs, etc. The rows in the table are for different sample sizes,

Page 8: Does feature selection improve classification accuracy? Impact of …adni.loni.usc.edu/adni-publications/Chu-NeuroImage-2012.pdf · 2019-06-04 · Technical Note Does feature selection

66 C. Chu et al. / NeuroImage 60 (2012) 59–70

themaximum training size of 262, with no sign of reaching a plateau.Weused Von Bertalanffy growth function, f(x)=a×(1−exp(−b×x)), to fitthe data points of accuracies vs. sample size and found out that the actualaccuracies were higher with larger sample size than the fitted curve (seeFig. 2).

For classifying MCI and NC, whole brain without feature selectionachieved an average accuracy of 67.3% at the maximum training sizeof 356. We also fitted the accuracies curve with Von Bertalanffy func-tion, and found similar results as classifying AD and NC that the actualaccuracies were higher than the fitted curve at large training size (seeFig. 2).

None of the data-driven feature selection methods (t-test filteringand RFE) achieved significantly superior performance than wholebrain (no feature selection) for classifying MCI and NC or classifyingAD and NC (see Fig. 3). On the other hand, the atlas based ROI methodshowed significantly (pb0.001) better accuracies than whole brain(no feature selection) using parahippocampus (PHG) ROI or parahip-pocampus+hippocampus (PHG+HC) ROIs for classifying MCI andNC and classifying AD and NC. On average, the use of PHG+HC wasbetter than PHG alone (see Fig. 4). However, for classifying AD andNC, HC was also significantly better than whole brain when the train-ing size was less than or equal to 166. The results were also similar forclassifying MCI and NC. When the training size was less than or equalto 76, HC was significantly better (see Fig. 5). Interestingly, temporallobe+HC+PHG resulted the best accuracies (pb0.001) when thetraining size was larger than 300, and the improvement over no fea-ture selection increased even more as the training size increased. Al-though other regions yielded accuracies higher than chance, their

AD & NC

MCI & NC

20%

30%

40%

50%

60%

70%

80%

90%

C* C0 C1 C2 C3 C4 C5

Accuracy

Varying C

ROI ( Sample Size 38 )

All

CG

HC

PHG

FG

Pcu

MTG

TL

HC+PHG

TL+HC+PHG

20%

30%

40%

50%

60%

70%

80%

90%

C* C0

Accuracy RO

All

HC

FG

MTG

HC+PHG

35%

45%

55%

65%

75%

85%

C* C0

Accuracy

Varying C

ROI (

All

PHG

MTG

TL+HC+PHG

35%

45%

55%

65%

75%

85%

C* C0 C1 C2 C3 C4 C5

Accuracy

Varying C

ROI ( Sample Size 36 )

All CG HC

PHG FG Pcu

MTG TL HC+PHG

TL+HC+PHG

d e

a b

Fig. 8. Classification accuracies obtained using different ROIs for classifying AD & NC and MCline is the accuracy obtained by using the whole brain (no feature selection). C* indicates laring feature size can be found in Table 2.

accuracies were either the same or worse than when using wholebrain.

For classifying AD and NC, combining t-test filtering and atlasbased ROI resulted in significantly better accuracies for all five levelsof inclusion when the sample sizes were small (see Fig. 6). Whenthe sample size grew larger, smaller numbers of ROIs achieved betteraccuracies. When the sample size was over 94, using only the topranked ROI achieved better accuracies, and including more ROIs re-duced the accuracies. The most frequently top ranked ROI was hippo-campus. The results were the opposite for classifying MCI and NC. Theclassification accuracies were only significantly better when the sam-ple size was over 260, and 5 levels of inclusion was employed (seeFig. 7). The tables of the original p-values are in the Supplementarymaterial. Versions of these figures with error bars may be found inthe Supplementary material.

We also used soft-margin SVM with different levels of regulariza-tion (i.e. different C) with feature selection. In general, lowering the C(i.e. increasing regularization) did not improve the accuracies formost ROIs, except those that already yielded better results thanwhole brain e.g. HC, PHG, and HC+PHG (see Fig. 8). The improve-ment of accuracies is most prominent in MCI vs. NC with larger sam-ple size. HC became significantly better than whole brain using theoptimal C value. For t-test and RFE, the best accuracies with the opti-mal C were still not significantly better than whole brain using hard-margin SVM (see Fig. 9).

The ROIs that yielded better classification accuracy than wholebrain for classifying MCI converters vs. MCI non-converters are simi-lar to the result of AD vs. NC, except that FG was significantly better

C1 C2 C3 C4 C5

Varying C

I ( Sample Size 134 )

CG

PHG

Pcu

TL

TL+HC+PHG

20%

30%

40%

50%

60%

70%

80%

90%

C* C0 C1 C2 C3 C4 C5

Accuracy

Varying C

ROI ( Sample Size 262)

All CG HC

PHG FG Pcu

MTG TL HC+PHG

TL+HC+PHG

C1 C2 C3 C4 C5

Sample Size 164 )

CG HC

FG Pcu

TL HC+PHG

35%

45%

55%

65%

75%

85%

C* C0 C1 C2 C3 C4 C5

Accuracy

Varying C

ROI ( Sample Size 332 )

All CG HC

PHG FG Pcu

MTG TL HC+PHG

TL+HC+PHG

f

c

I &NC with different sample sizes (resample 200 times) by soft-margin SVM. The darkge C value that is identical to a hard-margin SVM. The exact C value for the correspond-

Page 9: Does feature selection improve classification accuracy? Impact of …adni.loni.usc.edu/adni-publications/Chu-NeuroImage-2012.pdf · 2019-06-04 · Technical Note Does feature selection

AD & NC

MCI & NC

20%

30%

40%

50%

60%

70%

80%

90%

C* C0 C1 C2 C3 C4 C5

Accuracy

Varying C

t-test+RFE ( Sample Size 38 )

All t_82260 R_71229 t_25355

R_12754 t_11031 R_9542 t_7522

R_6463 t_4568

20%

30%

40%

50%

60%

70%

80%

90%

C* C0 C1 C2 C3 C4 C5

Accuracy

Varying C

t-test+RFE ( Sample Size 134 )

All t_82260 R_71229 t_25355

R_12754 t_11031 R_9542 t_7522

R_6463 t_4568

20%

30%

40%

50%

60%

70%

80%

90%

C* C0 C1 C2 C3 C4 C5

Accuracy

Varying C

t-test+RFE ( Sample Size 262 )

All t_82260 R_71229 t_25355

R_12754 t_11031 R_9542 t_7522

R_6463 t_4568

35%

45%

55%

65%

75%

85%

C* C0 C1 C2 C3 C4 C5

Accuracy

Varying C

t-test+RFE ( Sample Size 164)

All t_82260 R_71229 t_25355

R_12754 t_11031 R_9542 t_7522

R_6463 t_4568

35%

45%

55%

65%

75%

85%

C* C0 C1 C2 C3 C4 C5

Accuracy

Varying C

t-test+RFE ( Sample Size 332 )

All t_82260 R_71229 t_25355

R_12754 t_11031 R_9542 t_7522

R_6463 t_4568

35%

45%

55%

65%

75%

85%

C* C0 C1 C2 C3 C4 C5

Accuracy

Varying C

t-test+RFE ( Sample Size 36 )

All t_82260 R_71229 t_25355

R_12754 t_11031 R_9542 t_7522

R_6463 t_4568

e f

a b c

d

Fig. 9. Classification accuracies obtained using data-driven feature selection (RFE and t-test) for classifying AD & NC and MCI &NC with different sample sizes (resample 200 timesfor t-test, resample 50 times for RFE) by soft-margin SVM. The dark line is the accuracy obtained by using the whole brain (no feature selection). C* indicates large C value that isidentical to a hard-margin SVM. The exact C value for the corresponding feature size can be found in Table 2. Different lines show different feature selection, the prefix R indicatesRFE, and the prefix t indicates t-test. The number shows the number of features selected.

67C. Chu et al. / NeuroImage 60 (2012) 59–70

than whole brain when the sample sizes are small (see Fig. 10). Over-all, the classification accuracies were very low.

In addition, we compared the features selected from two data-driven techniques (t-test and RFE). The features selected using thetwo methods were very different. For illustrative purpose, we gener-ated the frequency maps of features selected in the re-resamples fromclassifying AD and NC with 11,031 features for both methods (seeFig. 11). The correlation between the frequency maps is corr(t-test,RFE)=0.31. T-tests selected the features consistently in each resam-pling trials, and most features were in the HC, PHG, and TL. RFEresulted in a more distributed selection, with the selected featuresscattered across the whole brain. However, features in HC were alsoconsistently selected.

4. Discussion

Generally speaking, the classification accuracies were higher withlarger sample size than the estimates from fitting a Von Bertalanffyfunction. This implies that the improvement of classification accura-cies due to increasing samples plateaus slower than the exponentialgrowth function. If the accuracies keep following the projection ofthe linear fit, prediction accuracies of 90% for separating AD and NCmay be achieved using around 500 training samples. However, themain focus in this study was comparing the performance, with andwithout the use of feature selection. From the results, we found thatappropriate methods of feature selection improved the classificationaccuracies regardless of the sample size. However, data-drive

methods without prior knowledge either did not improve the accura-cies or made them worse. We were surprised to see such results, andspeculate that it was because there was no additional information inthe training samples from which those feature selection methodscan extract and SVM cannot, especially when features were highlycorrelated. In our dataset, there may be hardly any non-informativevoxels, but only less-informative voxels. The information encoded inthe multivariate patterns also increased the difficulty of ranking fea-tures. Nevertheless, data driven approaches could still reduce theless-informative features, and maintain the most-informative voxelswithout any prior knowledge. This may be useful for people whorely on feature-weight map to make interpretation because featureselection reduces the chance of interpreting unreliable voxels in theweight-map.

In general, as the training size increased, the difference betweenthe accuracies, with and without feature selection, was reduced.This implies that any method of feature selection would not improvethe accuracies when the training size is sufficient. In the case of clas-sifying AD and NC, because of higher contrasts between the two clas-ses, the required samples are small. On the other hand, the contrastbetween MCI and NC is weaker and the patterns are less homogenous(i.e. larger effective dimensionality), thus more samples are requiredbefore whole brain can achieve the same accuracies as specific ROIs.In this study, even 356 samples were not sufficient.

It seemed that making use of prior knowledge resulted in thebest accuracies. However, this method depends on the reliabilityof the information used. For example, although functional studies

Page 10: Does feature selection improve classification accuracy? Impact of …adni.loni.usc.edu/adni-publications/Chu-NeuroImage-2012.pdf · 2019-06-04 · Technical Note Does feature selection

P<0.001

Converter & Non-Converter: ROI

45%

50%

55%

60%

65%

40 60 80 100 120 140 160 180 200

Accuracy

Sample size

All CG HCPHG FG PcuMTG TL HC+PHGTL+HC+PHG

CG HC PHG FG Pcu MTG TL HC+PHG TL+HC+PHG

1 2 3 4 5 6 7 8 9200 -1 0 1 -1 -1 -1 -1 1 0180 -1 1 1 -1 -1 -1 -1 1 0160 -1 1 1 0 -1 -1 -1 1 0140 -1 1 1 0 -1 -1 -1 1 0120 -1 1 1 0 -1 -1 -1 1 0100 -1 1 1 0 -1 -1 -1 1 080 0 1 1 1 -1 -1 0 1 160 0 1 1 1 0 0 0 1 140 0 1 1 1 0 0 0 1 0

C=Inf:: [ ROI whole brain ]

SS / ROI

Fig. 10. Graph on the top shows classification accuracies obtained using different ROIsfor classifying MCI converter and non-converter with different sample sizes (resample200 times). The dark line is the accuracy obtained using the whole brain (no feature se-lection). Table on the bottom shows results from paired t-tests of classification accura-cies between each ROI and whole brain (no feature selection). 1 indicates the ROI issignificantly better than whole brain, 0 indicates not significantly better, and −1 indi-cates whole brain is significantly better than that ROI. The rows are different samplesizes and the columns are different ROIs.

Fig. 11. The frequency maps of features selected in the re-resamples for

68 C. Chu et al. / NeuroImage 60 (2012) 59–70

(Ikonomovic et al., 2011) show a strong difference between pat-terns of brain activity in AD and NC in precuneus, the use of the pre-cuneus ROI (GM density) resulted in very poor classificationaccuracies. Also, the most well known marker of brain atrophy inAD is HC, but our results showed that combining HC and PHGresulted in higher accuracies than either of them alone. This impliesthat covariance between information encoded in the ROIs may helpclassification. This effect makes selecting the optimal combinationof ROIs difficult, and we can only claim that the selected ROIs aresub-optimal. Perhaps the best method of feature selection iscross-validation with pre-defined ROIs. From our results, wefound that the ranks of the classification accuracies of each ROIwere quite consistent across different sample sizes, especially forclassifying AD and NC. Although we only used a small validationset, we should still have a high chance of identifying the best ROIthrough cross-validation. Alternatively, we can also use thehybrid method (t-test+ROI) that uses the ROI as spatiallyconstrain and t-test as the ranking of features. The hybrid methoddid show significant, but small, improvements of classification ac-curacy in NC vs. AD and NC vs. MCI (Figs. 7 and 8). However, the se-lected ROIs in each resample were not always consistent, althoughhippocampus ROI was often ranked the top.

The choice of hard-margin or soft-margin SVM did not affect ourconclusion, as methods of feature selection and ROIs that did notyield better solutions still not perform better after optimizing thesoft-margin SVM. In practice, people often use three-way-splitcross-validation to estimate the test accuracies by determining the Cwith the cross-validation in the inner loops. This procedure wouldgreatly increase the computation. We took a pragmatic approach bypresenting the accuracies with all the C values we tested. The best ac-curacies obtained from one of the C should be the optimistic measurei.e. higher than the testing accuracy of three-way-split CV. Becausethe optimistic accuracies, which should be more conservative in ourcomparison, did not change our conclusions, we did not use thetime consuming three-way-split CV.

classifying AD and NC with 11,031 features for both t-test and RFE.

Page 11: Does feature selection improve classification accuracy? Impact of …adni.loni.usc.edu/adni-publications/Chu-NeuroImage-2012.pdf · 2019-06-04 · Technical Note Does feature selection

69C. Chu et al. / NeuroImage 60 (2012) 59–70

In this study, we applied feature selection as a separate procedureto the classifier. There are also classifiers with a built-in capability offeature selection during the training. These include lasso regression(Tibshirani, 1996), elastic net (Zou and Hastie, 2005), automatic rele-vance determination (Chu et al., 2010; MacKay, 1995; Tipping, 2001),and other sparse models (Carroll et al., 2009; Grosenick et al., 2008;Ryali et al., 2010; Yamashita et al., 2008). However, those methodswould require a lot of memory to select features from 300,000 voxels,which is often impractical. In practice, efficient methods of feature se-lection are often applied to eliminate the majority of less-informativefeatures, before applying the classifiers that can select features duringtraining.

In conclusion, our results confirmed the previous study(Cuingnet et al., 2010) that features selection did not improve theperformance if there was no prior knowledge. That is to say, if theright methods and the right prior knowledge are used, feature se-lection does improve classification performance. Although, wedemonstrated the effect using anatomical MRI, our experience inPittsburgh brain activity interpretation competition (PBAIC 2007)(Chu et al., 2011) also led to the same conclusion. Teams usingdata driven feature selection to select fMRI features did not per-formed well in the competition than teams applied prior knowl-edge in feature selection.

Acknowledgments

This research was supported in part by the Intramural ResearchProgram of the NIH, NIMH, National Science Council of Taiwan (NSC100-2628-E-010-002-MY3, NSC 100-2627-B-010-004-) and NationalHealth Research Institute (NHRI-EX100-9813EC). JA is funded bytheWellcome Trust. This study utilized the high performance compu-tational capabilities of the Biowulf Linux cluster at the National Insti-tutes of Health, Bethesda, MD (http://biowulf.nih.gov).

Data collection and sharing for this project was funded by the Alz-heimer's Disease Neuroimaging Initiative (ADNI) (National Institutesof Health Grant U01 AG024904). ADNI is funded by the National Insti-tute on Aging, the National Institute of Biomedical Imaging and Bio-engineering (NIBIB), and through generous contributions from thefollowing: Abbott, AstraZeneca AB, Bayer Schering Pharma AG,Bristol-Myers Squibb, Eisai Global Clinical Development, Elan Corpo-ration, Genentech, GE Healthcare, GlaxoSmithKline, Innogenetics,Johnson and Johnson, Eli Lilly and Co.,Medpace, Inc.,Merck and Co.,Inc., Novartis AG, Pfizer Inc., F. Hoffman-La Roche, Schering-Plough,Synarc, Inc., as well as non-profit partners the Alzheimer's Associationand Alzheimer's Drug Discovery Foundation, with participation fromthe U.S. Food and Drug Administration. Private sector contributionsto ADNI are facilitated by the Foundation for the National Institutesof Health (www.fnih.org). The grantee organization is the NorthernCalifornia Institute for Research and Education, and the study is coor-dinated by the Alzheimer's Disease Cooperative Study at the Univer-sity of California, San Diego. ADNI data are disseminated by theLaboratory for Neuro Imaging at the University of California, LosAngeles.

Appendix A. Supplementary data

Supplementary data to this article can be found online at doi:10.1016/j.neuroimage.2011.11.066.

References

Ashburner, J., 2007. A fast diffeomorphic image registration algorithm. NeuroImage 38,95–113.

Ashburner, J., Friston, K.J., 2005. Unified segmentation. NeuroImage 26, 839–851.Ashburner, J., Friston, K.J., 2009. Computing average shaped tissue probability tem-

plates. NeuroImage 45, 333–341.

Ashburner, J., Kloppel, S., 2011. Multivariate models of inter-subject anatomical vari-ability. NeuroImage 56, 422–439.

Bastos Leite, A.J., Scheltens, P., Barkhof, F., 2004. Pathological aging of the brain: anoverview. Top. Magn. Reson. Imaging 15, 369–389.

Bishop, C.B., 2006. Pattern recognition and machine learning. Springer.Carroll, M.K., Cecchi, G.A., Rish, I., Garg, R., Rao, A.R., 2009. Prediction and interpretation

of distributed neural activity with sparse models. NeuroImage 44, 112–122.Chang, C.C., Lin, C.J., 2001. LIBSVM: a library for support vector machines.Chetelat, G., Desgranges, B., De La Sayette, V., Viader, F., Eustache, F., Baron, J.C., 2002.

Mapping gray matter loss with voxel-based morphometry in mild cognitive im-pairment. Neuroreport 13, 1939–1943.

Chu, C., Bandettini, P., Ashburner, J., Marquand, A., Kloeppel, S., 2010. Classification of neuro-degenerative diseases using Gaussian process classification with automatic feature de-termination. ICPR 2010 First Workshop on Brain Decoding. IEEE, pp. 17–20.

Chu, C., Ni, Y., Tan, G., Saunders, C.J., Ashburner, J., 2011. Kernel regression for fMRI pat-tern prediction. NeuroImage 56, 662–673.

Chu, C.Y.C., 2009. PhD Thesis: Pattern recognition and machine learning for magneticresonance images with kernel methods. Wellcome Trust centre for Neuroimaging.University College London, London.

Costafreda, S.G., Chu, C., Ashburner, J., Fu, C.H., 2009. Prognostic and diagnostic poten-tial of the structural neuroanatomy of depression. PLoS One 4, e6353.

Craddock, R.C., Holtzheimer III, P.E., Hu, X.P., Mayberg, H.S., 2009. Disease state predic-tion from resting state functional connectivity. Magn. Reson. Med. 62, 1619–1628.

Cristianini, N., Shawe-Taylor, J., 2000. An Introduction to Support Vector Machines andOther Kernel-based Learning Methods. Cambridge University Press.

Cuingnet, R., Gerardin, E., Tessieras, J., Auzias, G., Lehericy, S., Habert, M.O., Chupin, M.,Benali, H., Colliot, O., Initiative., A.s.D.N., 2002. Automatic classification of patientswith Alzheimer's disease from structural MRI: a comparison of ten methods usingthe ADNI database. NeuroImage 56 (2), 776–781.

De Martino, F., Valente, G., Staeren, N., Ashburner, J., Goebel, R., Formisano, E., 2008.Combining multivariate voxel selection and support vector machines for mappingand classification of fMRI spatial patterns. NeuroImage 43, 44–58.

Eger, E., Ashburner, J., Haynes, J.D., Dolan, R.J., Rees, G., 2008. fMRI activity patterns inhuman LOC carry information about object exemplars within category. J. Cogn.Neurosci. 20, 356–370.

Fan, Y., Rao, H., Hurt, H., Giannetta, J., Korczykowski, M., Shera, D., Avants, B.B., Gee, J.C.,Wang, J., Shen, D., 2007. Multivariate examination of brain abnormality using bothstructural and functional MRI. NeuroImage 36, 1189–1199.

Fan, Y., Shen, D., Davatzikos, C., 2005. Classification of structural images via high-dimensional image warping, robust feature extraction, and SVM. Med ImageComput Comput Assist Interv Int Conf Med Image Comput Comput Assist Interv,8, pp. 1–8.

Filipovych, R., Davatzikos, C., 2011. Semi-supervised pattern classification of medical im-ages: application to mild cognitive impairment (MCI). NeuroImage 55, 1109–1119.

Forster, M.R., 2002. Predictive accuracy as an achievable goal of science. Philos. Sci. 69,S124–S134.

Frisoni, G.B., Testa, C., Zorzan, A., Sabattoli, F., Beltramello, A., Soininen, H., Laakso, M.P.,2002. Detection of grey matter loss in mild Alzheimer's disease with voxel basedmorphometry. J. Neurol. Neurosurg. Psychiatry 73, 657–664.

Gerardin, E., Chetelat, G., Chupin, M., Cuingnet, R., Desgranges, B., Kim, H.S.,Niethammer, M., Dubois, B., Lehericy, S., Garnero, L., Eustache, F., Colliot, O.,2009. Multidimensional classification of hippocampal shape features discrimi-nates Alzheimer's disease and mild cognitive impairment from normal aging.NeuroImage 47, 1476–1486.

Grosenick, L., Greer, S., Knutson, B., 2008. Interpretable classifiers for FMRI improveprediction of purchases. Neural Systems and Rehabilitation Engineering, IEEETransactions on, 16, pp. 539–548.

Guyon, I., Elisseeff, A.e., 2003. An introduction to variable and feature selection. J. Mach.Learn. Res. 3, 1157–1182.

Guyon, I., Weston, J., Barnhill, S., Vapnik, V., 2002. Gene selection for cancer classifica-tion using support vector machines. Mach. Learn. 46, 389–422.

Ikonomovic, M.D., Klunk, W.E., Abrahamson, E.E., Wuu, J., Mathis, C.A., Scheff, S.W.,Mufson, E.J., Dekosky, S.T., 2011. Precuneus amyloid burden is associated with re-duced cholinergic activity in Alzheimer disease. Neurology 77, 39–47.

Kloppel, S., Chu, C., Tan, G.C., Draganski, B., Johnson, H., Paulsen, J.S., Kienzle, W., Tabrizi, S.J.,Ashburner, J., Frackowiak, R.S., 2009. Automatic detection of preclinical neurodegenera-tion: presymptomatic Huntington disease. Neurology 72, 426–431.

Kloppel, S., Stonnington, C.M., Chu, C., Draganski, B., Scahill, R.I., Rohrer, J.D., Fox, N.C.,Jack Jr., C.R., Ashburner, J., Frackowiak, R.S., 2008. Automatic classification of MRscans in Alzheimer's disease. Brain 131, 681–689.

MacKay, D.J.C., 1995. Probable networks and plausible predictions a review of practicalBayesian methods for supervised neural networks. Network: Computation in Neu-ral Systems, 6, pp. 469–505.

Marquand, A.F., De Simoni, S., O'Daly, O.G., Mehta, M.A., Mourao-Miranda, J., 2010.Quantifying the information content of brain voxels using target information.ICPR First Workshop on Brain Decoding. IEEE, Istanbul, pp. 13–16.

Marquand, A.F., De Simoni, S., O'Daly, O.G., Williams, S.C., Mourao-Miranda, J., Mehta,M.A., 2011. Pattern classification of working memory networks reveals differentialeffects of methylphenidate, atomoxetine, and placebo in healthy volunteers. Neu-ropsychopharmacology 36, 1237–1247.

Mourao-Miranda, J., Bokde, A.L., Born, C., Hampel, H., Stetter, M., 2005. Classifying brainstates and determining the discriminating activation patterns: support vector ma-chine on functional MRI data. NeuroImage 28, 980–995.

Pelaez-Coca, M., Bossa, M., Olmos, S., 2010. Discrimination of AD and normal sub-jects from MRI: anatomical versus statistical regions. Neurosci. Lett. 487,113–117.

Page 12: Does feature selection improve classification accuracy? Impact of …adni.loni.usc.edu/adni-publications/Chu-NeuroImage-2012.pdf · 2019-06-04 · Technical Note Does feature selection

70 C. Chu et al. / NeuroImage 60 (2012) 59–70

Pelaez-Coca, M., Bossa, M., Olmos, S., 2011. Discrimination of AD and normal subjectsfrom MRI: anatomical versus statistical regions. Neurosci. Lett. 487, 113–117.

Pennanen, C., Testa, C., Laakso, M.P., Hallikainen, M., Helkala, E.L., Hanninen, T., Kivipelto, M.,Kononen,M., Nissinen, A., Tervo, S., Vanhanen,M., Vanninen, R., Frisoni, G.B., Soininen,H.,2005. A voxel based morphometry study on mild cognitive impairment. J. Neurol. Neu-rosurg. Psychiatry 76, 11–14.

Ryali, S., Supekar, K., Abrams, D.A., Menon, V., 2010. Sparse logistic regression forwhole-brain classification of fMRI data. NeuroImage 51, 752–764.

Saeys, Y., Inza, I., Larranaga, P., 2007. A review of feature selection techniques in bioin-formatics. Bioinformatics 23, 2507.

Shattuck, D.W., Mirza, M., Adisetiyo, V., Hojatkashani, C., Salamon, G., Narr, K.L.,Poldrack, R.A., Bilder, R.M., Toga, A.W., 2008. Construction of a 3D probabilisticatlas of human cortical structures. NeuroImage 39, 1064–1080.

Stonnington, C.M., Chu, C., Kloppel, S., Jack Jr., C.R., Ashburner, J., Frackowiak, R.S., 2010.Predicting clinical scores from magnetic resonance scans in Alzheimer's disease.NeuroImage 51, 1405–1413.

Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser.B (Methodological) 58, 267–288.

Tipping, M.E., 2001. Sparse Bayesian learning and the relevance vector machine. J.Mach. Learn. Res. 1, 211–244.

Vapnik, V., 1995. The Nature of Statistical Learning Theory. NY Springer.Yamashita, O., Sato, M., Yoshioka, T., Tong, F., Kamitani, Y., 2008. Sparse estimation au-

tomatically selects voxels relevant for the decoding of fMRI activity patterns. Neu-roImage 42, 1414–1429.

Yang, W., Lui, R.L., Gao, J.H., Chan, T.F., Yau, S.T., Sperling, R.A., Huang, X., 2011. Inde-pendent component analysis-based classification of Alzheimer's disease MRIData. J Alzheimers Dis.

Zhang, D., Wang, Y., Zhou, L., Yuan, H., Shen, D., 2011. Multimodal classification ofAlzheimer's disease and mild cognitive impairment. NeuroImage 55,856–867.

Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. J. R.Stat. Soc. Ser. B (Statistical Methodology) 67, 301–320.