6
1 Ensemble-based Classifiers for Cancer Classification Using Human Tumor Microarray Data Argin Margoosian, Student Member, IEEE, and Jamshid Abouei, Member, IEEE, Dept. of Electrical and Computer Engineering, Yazd University, Yazd, Iran, Emails: [email protected], [email protected] Abstract— In this paper, two cancer classification techniques based on multicategory microarray data sets are presented. Due to the high dimensionality of microarray data sets, choosing reliable feature selection and classification algorithms with a high degree of accuracy and a low complexity is a crucial task in bioinformatics. Toward this goal, this paper aims to maximize the cancer classification accuracy using two reliable ensemble-based classifiers namely the ensemble of naive bayes and the ensemble of k-nearest neighbor. Simulation results show that our classifiers have considerably better accuracy than some conventional classification techniques such as the Support Vector Machine (SVM) and artificial neural networks in the field of multicategory microarray cancer classification based on fourteen cancer data set. However, the run time of the introduced ensemble-based classifiers is longer when the schemes use whole features. To reduce the time complexity while preserving the same classification accuracy as before, we use the recursive feature elimination based on the multiple support vector machine classifier to select more informative genes before applying the ensemble-based classifiers. Numerical evaluations show at least 30% improvement in the classification accuracy of our schemes when compared to the SVM-one versus one rule. In addition, our schemes are much more robust to the feature elimination and display a high accuracy in the case of low number of features. Index Terms— Cancer classification, ensemble-based methods, microarray data set, naive bayes classifier, recursive feature elimination. I. I NTRODUCTION In recent years, the cancer classification task in bioinfor- matics has been drawn the attention of researchers to develop reliable classifiers with a high degree of accuracy and a low computational complexity [1], [2]. Microarray experiments have been recognized as revolutionary improvements in the field of cancer diagnosis that provide measurements of ex- tremely large number of genes simultaneously [3]. However, the main challenge in using the microarray data sets is their high dimensionality. Most of the studies in this area are based on databases with a few number of classes, e.g., data sets with two or three classes (e.g. [1], [4], [5]). Of interest is ensemble-based classifiers which are recognized as powerful algorithms with numerous advantages over single classifiers in many different applications [6]–[8], in particular in the case of complicated classification problems. Ensemble-based classifiers are investigated for the binary classification task in [1], [4], [9]. Ghorai et al. [1] present an ensemble Non-parallel The work was financially supported by the research grant T-500-5857 from the SyberSpace Research Institute (Iran Telecommunications Research Center). Plane Proximal Classifier (NPPC) for the binary cancer clas- sification where a genetic algorithm is utilized for the feature and the model selection to train a number of NPPCs. The authors in [4] use an adaboost method to generate an ensemble classifier for the discrimination of cancer tissues from normal ones. Reference [9] presents a rule-based ensemble classifier based on the colon cancer and leukemia two class cancer data sets. In addition, the multicategory cancer diagnosis such as the classification of fourteen cancer data set is more considered in feature selection studies rather than classification algorithms [10]–[13]. The authors in [10] investigate two Multiclass Sup- port Vector Machine (MSVM) techniques as a combination of binary classifiers using the one-versus-one and the one-versus- all rules for the classification of a fourteen cancer data set. Reference [11] develops a subsequent artificial neural network for the multicategory microarray analysis where the results are compared with the conventional Artificial Neural networks (ANN) for a fourteen cancer data set. Zheng et al. [12] use the extreme learning machine technique in single-hidden layer feedforward neural networks for the cancer diagnosis. The results show a better accuracy than the conventional ANN and SVM methods. The authors in [13] present a fuzzy classifier based on the memetic algorithm for the cancer classification, namely Cancer Diagnosis with Memetic Fuzzy System (CD- MFS). This cancer diagnosis system works based on some fuzzy “if-then” rules where the number of these rules are minimized to reduce the computational complexity. Although the accuracy of the CD-MFS method is much better than that of some conventional classification methods such as the SVM approach, but the results are not satisfactory for the cancer classification task. There exist a few research works on ensemble classifiers for multicategory classification where most of the works claim that these methods have low performances for a large number of classes [14], [15]. Reference [14] investigates an ensemble of Classification and Regression Tree (CART) for eight class NCI 60 cancer data set where this classifier displays a fifty percent accuracy. Dettling et al. [15] present an ensemble classifier by converting the multicategory classification into some binary ones using one-versus-one and one-versus-all rules and then utilizing the LogitBoost to generate the base classifiers. The results of this classifier for the NCI 60 data set which is based on the cross validation is better than that of the scheme in [14] that utilizes a separate subset of samples in the test phase, but the results in [15] is not satisfactory yet in terms of the classification accuracy. Most of the conventional cancer classification algorithms are unable to handle the microarray 978-1-4673-5634-3/13/$31.00 ©2013 IEEE

[IEEE 2013 21st Iranian Conference on Electrical Engineering (ICEE) - Mashhad, Iran (2013.05.14-2013.05.16)] 2013 21st Iranian Conference on Electrical Engineering (ICEE) - Ensemble-based

  • Upload
    jamshid

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

1

Ensemble-based Classifiers for Cancer ClassificationUsing Human Tumor Microarray Data

Argin Margoosian, Student Member, IEEE, and Jamshid Abouei, Member, IEEE,Dept. of Electrical and Computer Engineering, Yazd University, Yazd, Iran,

Emails: [email protected], [email protected]

Abstract— In this paper, two cancer classification techniquesbased on multicategory microarray data sets are presented. Dueto the high dimensionality of microarray data sets, choosingreliable feature selection and classification algorithms with ahigh degree of accuracy and a low complexity is a crucialtask in bioinformatics. Toward this goal, this paper aims tomaximize the cancer classification accuracy using two reliableensemble-based classifiers namely the ensemble of naive bayesand the ensemble of k-nearest neighbor. Simulation results showthat our classifiers have considerably better accuracy than someconventional classification techniques such as the Support VectorMachine (SVM) and artificial neural networks in the field ofmulticategory microarray cancer classification based on fourteencancer data set. However, the run time of the introducedensemble-based classifiers is longer when the schemes use wholefeatures. To reduce the time complexity while preserving thesame classification accuracy as before, we use the recursivefeature elimination based on the multiple support vector machineclassifier to select more informative genes before applying theensemble-based classifiers. Numerical evaluations show at least30% improvement in the classification accuracy of our schemeswhen compared to the SVM-one versus one rule. In addition, ourschemes are much more robust to the feature elimination anddisplay a high accuracy in the case of low number of features.

Index Terms— Cancer classification, ensemble-based methods,microarray data set, naive bayes classifier, recursive featureelimination.

I. INTRODUCTION

In recent years, the cancer classification task in bioinfor-matics has been drawn the attention of researchers to developreliable classifiers with a high degree of accuracy and a lowcomputational complexity [1], [2]. Microarray experimentshave been recognized as revolutionary improvements in thefield of cancer diagnosis that provide measurements of ex-tremely large number of genes simultaneously [3]. However,the main challenge in using the microarray data sets is theirhigh dimensionality. Most of the studies in this area are basedon databases with a few number of classes, e.g., data setswith two or three classes (e.g. [1], [4], [5]). Of interest isensemble-based classifiers which are recognized as powerfulalgorithms with numerous advantages over single classifiersin many different applications [6]–[8], in particular in thecase of complicated classification problems. Ensemble-basedclassifiers are investigated for the binary classification task in[1], [4], [9]. Ghorai et al. [1] present an ensemble Non-parallel

The work was financially supported by the research grant T-500-5857from the SyberSpace Research Institute (Iran Telecommunications ResearchCenter).

Plane Proximal Classifier (NPPC) for the binary cancer clas-sification where a genetic algorithm is utilized for the featureand the model selection to train a number of NPPCs. Theauthors in [4] use an adaboost method to generate an ensembleclassifier for the discrimination of cancer tissues from normalones. Reference [9] presents a rule-based ensemble classifierbased on the colon cancer and leukemia two class cancer datasets. In addition, the multicategory cancer diagnosis such asthe classification of fourteen cancer data set is more consideredin feature selection studies rather than classification algorithms[10]–[13]. The authors in [10] investigate two Multiclass Sup-port Vector Machine (MSVM) techniques as a combination ofbinary classifiers using the one-versus-one and the one-versus-all rules for the classification of a fourteen cancer data set.Reference [11] develops a subsequent artificial neural networkfor the multicategory microarray analysis where the resultsare compared with the conventional Artificial Neural networks(ANN) for a fourteen cancer data set. Zheng et al. [12] usethe extreme learning machine technique in single-hidden layerfeedforward neural networks for the cancer diagnosis. Theresults show a better accuracy than the conventional ANN andSVM methods. The authors in [13] present a fuzzy classifierbased on the memetic algorithm for the cancer classification,namely Cancer Diagnosis with Memetic Fuzzy System (CD-MFS). This cancer diagnosis system works based on somefuzzy “if-then” rules where the number of these rules areminimized to reduce the computational complexity. Althoughthe accuracy of the CD-MFS method is much better than thatof some conventional classification methods such as the SVMapproach, but the results are not satisfactory for the cancerclassification task.

There exist a few research works on ensemble classifiersfor multicategory classification where most of the works claimthat these methods have low performances for a large numberof classes [14], [15]. Reference [14] investigates an ensembleof Classification and Regression Tree (CART) for eight classNCI 60 cancer data set where this classifier displays a fiftypercent accuracy. Dettling et al. [15] present an ensembleclassifier by converting the multicategory classification intosome binary ones using one-versus-one and one-versus-allrules and then utilizing the LogitBoost to generate the baseclassifiers. The results of this classifier for the NCI 60 data setwhich is based on the cross validation is better than that of thescheme in [14] that utilizes a separate subset of samples in thetest phase, but the results in [15] is not satisfactory yet in termsof the classification accuracy. Most of the conventional cancerclassification algorithms are unable to handle the microarray

978-1-4673-5634-3/13/$31.00 ©2013 IEEE

2

data sets due to the curse of dimensionality which is a sig-nificant challenge in the microarray data analysis. In fact, thehigh dimensionality of the schemes imposes a longer run timeof the classification. Gene selection algorithms have emergedas efficient approaches to mitigate the curse of dimensionality[16]–[19]. In most cancer diagnosis processes, the schemesutilize multicategory microarray data sets to investigate the ex-traction of informative genes [16]–[18] than the classificationalgorithms [10]–[13]. The main goal of using gene selectionalgorithms is to reduce the computational complexity and therun time through selecting more informative genes that makesmicroarray data sets easy to be handled [20]. The RecursiveFeature Elimination (RFE) algorithm is an efficient method inreducing the number of features that yields a lower run time ora lower computational complexity as compared to the case thatthe scheme considers the whole features for the classificationtask. In the Recursive Feature Elimination based on MultipleSupport Vector Machine (MSVM-RFE) classifier, the SVMclassification algorithm is run repeatedly by scoring genes ineach step and eliminates the least scored genes to reach to thepre-specified number of genes [18].

The low accuracy of cancer classification algorithms in[10]–[15] motivates us to investigate two reliable ensemble-based classifiers with a higher accuracy for the multicategorycancer classification based on the adaboost procedure using thenaive bayes and the k-nearest neighbor as the base classifiers.In this paper, we use the ensemble-based cancer classificationalgorithms supported by simulation results to achieve a muchbetter classification accuracy than the results in [10]–[13].This performance improvement is mainly due to the fact thatwe design a set of weak classifiers using different subsets ofsamples. We try to improve the performance through connect-ing those weak classifiers sequentially and classify correctlythe misclassified samples at the previous steps. Simulationresults show that the accuracy of the above ensemble-basedclassifiers is better with a lower possibility of overfitting thansingle classifier-based methods such as ANNs and the SVM-based algorithms in [10]–[13] in the field of multicategorymicroarray cancer classification with a fourteen cancer dataset. In addition, the schemes benefit from the ability to handlethe whole data set similar to the SVM in [10], however,the penalty is a longer run time for the above schemes.For this regard, the MSVM-RFE algorithm is utilized toselect more informative genes before the classification taskto reduce considerably the run time through decreasing thecomputational complexity, while preserving the same classifi-cation accuracy as before. Numerical evaluations show that ourschemes display at least 30% improvement in the classificationaccuracy when compared to the SVM-one versus one rule. Inaddition, our schemes are much more robust to the featureelimination and display a high accuracy in the case of lownumber of features.

The rest of the paper is organized as follows. In Section II,some primary concepts and assumptions are described. Theintroduced ensemble-based cancer classifiers are described inSection III. In Section IV, simulation results for the fourteencancer data set are presented. Finally, an overview of theresults and conclusions are presented in Section V.

Fig. 1. Main steps in the microarray data extraction process.

II. PRIMARY CONCEPTS AND ASSUMPTIONS

In this section, we briefly introduce some primary conceptsto develop our cancer classification algorithms based on themicroarray data sets and two well known naive bayes and k-nearest neighbor classifiers.

Microarray Data Set and Gene Selection: Microarraydata sets have been recognized as strong tools in the cancerdiagnosis that provide measurements of an extremely largenumber of genes simultaneously, however the number ofsamples is much fewer than the number of features in mostDNA microarray data sets. A microarray is typically a glassslide on which DNA molecules are placed in an orderedmanner at specific locations called spots. The extraction ofmicroarray data sets follows a standard procedure depictedin Fig.1, where the extraction process is summarized in fourmain steps [3]: i) extract all RNA from the cells, ii) theextracted RNA molecules are labeled with different fluorescentdyes, iii) the labeled samples are hybridized onto the sameglass slide, iv) the spots in the hybridized microarray areexcited by a laser and are scanned at suitable wavelengths.Different scanners with different capabilities are invented forthe microarray extraction such as Affymetrix scanners.

As previously mentioned, the main challenge in DNAmicroarrays is the low sample to feature ratio that causesthe curse of dimensionality in the data analysis process usingmachine learning methods. Many gene/feature selection meth-ods are proposed in the literature to select the genes whichcan discriminate the samples of different classes in the bestway [16]–[20]. Through applying the gene selection before theclassification block, we can reduce the dimensionality of thedata set resulting in a much lower complexity (or equivalentlya lower run time) and maybe a higher accuracy dependingon the utilized classifier. Since the classical feature selectionmethods such as the Principal Component Analysis (PCA) andthe Linear Discriminant Analysis (LDA) do not work properlyfor cancer microarray data sets, the attention of researchers hasbeen drawn toward finding new methods best suited for thesedata sets. The Recursive Feature Elimination (RFE) scheme isan efficient method in the area of cancer classification [18],even though the authors in [2] introduce a new gene selectionapproach that works better than the RFE scheme. However,the proposed scheme in [2] is investigated only for the binaryclassification that is not the case in this paper.

Naive Bayes Classifier: The bayes decision rule along withthe assumption of conditionally independent features yieldthe naive bayes classifier, in which the number of classifierparameters is reduced dramatically from 2(2n − 1) for thebayes classifier to 2n for the naive bayes [19]. Assumingthe features X1, ..., XN are independent, the joint probabilitydensity function of X1, ..., XN conditioned on the class Y

3

can be written as P (X1, ..., XN |Y ) =N∏i=1

P (Xi|Y ), and the

decision rule in the naive bayes classifier becomes as

Y ← argmaxyk

P (Y = yk)N∏i=1

P (Xi|Y = yk). (1)

Despite the simplicity, the naive bayes classifier displays agood performance when we have a few number of trainingsamples in each class that is the case in our DNA microarraydata set. However for highly dependent features, the perfor-mance of the naive bayes classifier is extremely degraded.k-Nearest Neighbor (KNN) Classifier: The k-nearest

neighbor is one of the simplest classification algorithms inwhich a new sample in the test set is assigned to a class thatis the most common amongst its k-nearest training samples[19]. For the case k = 1, a test sample is assigned to theclass associated with its nearest neighbor. Since, the KNNscheme with k = 1 measures only one distance betweenthe test sample and the nearest neighbor, its computationalcomplexity is much lower than the KNN algorithm for largervalues of k. Choosing the best integer value for k in termsof the accuracy, complexity and the training data set isone challenge issue in this method and it can be selectedby various heuristic techniques such as the cross-validation.Various distance metrics are used in the KNN classifier tofind the nearest neighbors. To reduce the complexity of thebase classifiers, in this paper, we use the nearest neighborclassifier along with the common Euclidean distance metric.It is worth mentioning that in the case of a large number oftraining samples, the KNN classifier is guaranteed to yield anerror rate no worse than twice the bayes error rate.

III. ENSEMBLE-BASED CLASSIFIERS IN CANCERDIAGNOSIS SYSTEMS

In this work, we consider ensemble-based classifiers in thecancer classification systems based on multicategory microar-ray cancer data sets to achieve a high degree of accuracy.We use the adaBoost method to design a powerful classifierusing the naive bayes and the k-nearest neighbor classifiersas base or weak ones connected in a sequential manner [21].Each base classifier is trained by a subset of training samplesselected based on the weight distribution of samples and theresampling technique. The sample weight distribution is up-dated after designing each ensemble by increasing the weightsof incorrectly classified samples which in turn increases thechance of the misclassified samples to be selected in the nextstep. These base classifiers are combined using the weightedmajority voting rule resulting in an ensemble-based classifierwith much better performance than a single classifier wherethe training error converges to zero by increasing the numberof base classifiers. In general, a classifier with zero trainingerror results in overfitting on the training samples, however,this is not the case with ensemble-based methods.

We consider two well known naive bayes and the k-nearestneighbor base classifiers in the adaBoost procedure to achievetwo ensemble-based classifiers following the procedure inFig.2. In the ensemble of naive bayes classifier, the prior

Fig. 2. Black diagram of the investigated ensemble-based classifier.

probabilities of each class is chosen proportional to the numberof samples in each class of the training set. To reduce thecomputational complexity of the base classifiers and to get ridof the process of choosing the best value of k in the ensembleof KNN algorithm, we select k = 1, since in the ensemble-based methods having an error rate lower than the randomguessing for the base classifiers is enough to reach a goodperformance.

The investigated methods demonstrate much better perfor-mance than single classifiers introduced for the fourteen cancerdata set, however, the run time is increased slightly comparedwith the well known single classifiers such as the SVMscheme. This penalty is compensated by preprocessing datausing the MSVM-RFE gene selection technique and removingthe redundant features in the data set before the classificationblock presented above. The whole process of the feature selec-tion and the classification for multicategory microarray cancerdata sets is summarized in Table I. It is worth mentioningthat the investigated ensemble-based classifiers do not haveany challenge of the parameter selection, except the numberof the base classifiers which can be easily chosen to balancebetween the computational complexity (or the run time) andthe accuracy. On the other hand, the naive bayes classifier canbe implemented in recursive manner to improve the classifierafter receiving each test sample.

IV. NUMERICAL RESULTS

To evaluate the effectiveness of the investigated ensemble-based cancer classification methods in terms of the classifica-tion accuracy, we present some numerical simulations based onthe benchmark fourteen cancer data set in [10]. According toTable II, this data set contains 144 training and 54 test samplesas well as 16063 genes measured for each sample. Unlike themost benchmark cancer data sets such as ALL-AML, colon,prostate, lymphoma, and Ovarian cancer data sets [1], [4], [5],the challenging properties of the fourteen cancer data set is thelarge number of classes and features. To reduce the number

4

TABLE IINVESTIGATED ENSEMBLE-BASED CANCER CLASSIFICATION ALGORITHM FOR 14 CANCER DATA SET

Input: + Labeled training samples xi, i = 1, ..., N , with labels yi ∈ {ω1, ..., ωc};+ Weak classification algorithm:

r k-Nearest Neighbor (KNN) classifier.r Naive Bayes (NB) classifier.

+ Number of the base classifiers denoted by T.Output: Ensemble decision.

Initialize: + Assign the uniform weight distribution D1(i) =1

N, i = 1, ..., N, for all the samples.

+ Select about 340 more informative features out of 16063 using the MSVM-RFE.About 50 features for the ENB scheme and 340 features for the ENN classifier.

Algorithm:+ For k = 1 : T (Step k: determine the kth base classifier)

r Select a subset of training samples using the resampling technique and based on D1(i).r Train the base classifier hk using the selected training samples subset.

r Calculate the error of kth base classifier through εk =N∑i=1

Dk(i)|hk(xi)− yi|.

r Update D1(i) by increasing the weights of misclassified samples in the previous step:βk =

εk

(1− εk),

Dk+1(i) =Dk(i)

N∑i=1

Dk(i)

×{

βk, if hk(xi) = yi1, otherwise ,

+ Generate the ensemble-based classifier by combing the base classifiers using the weighted majority votingto classify a given test samples x.

r Calculate the total vote for each class: Vj =∑

k : hk(x)=ωj

log1

βk, j = 1, ..., C,

r Assign the sample x to the class ωj with the highest total vote Vj .

TABLE IIBENCHMARK FOURTEEN CANCER DATA SET [10].

Cancer type Number of training Number of testsamples samples

Breast 8 4Prostate 8 6Lung 8 4Colorectal 8 4Lymphoma 16 6Bladder 8 3Melanoma 8 2Uterus 8 2Leukemia 24 6Renal 8 3Pancreas 8 3Ovarian 8 4Mesothelioma 8 3Central Nerve System 16 4

of features, we first apply the MSVM-RFE to the intendeddata set by selecting almost 340 more informative genes. Allsimulations have been run in the MATLAB v7.11 environmenton a PC with Intel Core i7 2.66 GHz CPU and 4 GB RAM.

We first evaluate the cancer diagnosis accuracy of theensemble-based k-nearest neighbor (ENN) and the ensemble-based naive bayes (ENB) classifiers for different number ofbase classifiers in Fig. 3. All the results in this paper areobtained through averaging over twenty iterations for eachsimulation. In our simulation setup, the integer k is chosenequal to one that is a logical assumption due to a few numberof training samples in each class with emphasis to this pointthat it reduces the complexity of the base classifiers. Theclassification accuracy or equivalently the error rate definedas “1− accuracy” is presented for both training and the testsamples in Fig. 3, where we can see that the training error

converges to zero using a few (almost five) base classifiers.The result indicates that the best number of the base classifierslies among ten to thirty depending on our considerations ofthe classification accuracy and the run time (complexity).

According to Fig. 4, the ensemble-based naive bayes clas-sifier has significantly higher accuracy than the ensemble-based k-nearest neighbor (ENN) algorithm. The result comesfrom the fact that the bayes decision rule is an optimalmethod with the least possible error. On the other hand, theassumption of the independency of the selected features isalmost satisfied and the features are uncorrelated with a goodaccuracy. Fig. 4 illustrates the classification accuracy of theinvestigated ENB and ENN classifiers versus the number ofbase classifiers, where the schemes are compared with theEnsemble of Classification and Regression Tree (ECART)algorithm used for the NCI 60 data set in [14]. It is observedthat both ENN and ENB classifiers have considerably betterclassification accuracy than the ECART method in [14] forthe fourteen cancer microarray data set. One of the maindisadvantages of the ECART is the sensitivity of the schemeto training points meaning that the variation of even a singletraining point can lead to radically different decisions. Oursimulation results clearly show that by increasing the numberof the base classifiers we obtain a better classification accuracy,however as a trade off, the computational complexity and therun time will increase. To satisfy both classification accuracyand the computational complexity considerations, the numberof the base classifiers can be determined sufficiently betweentwenty five and thirty.

To complete our simulation results, the investigatedensemble-based classifiers are compared with some existingclassification algorithms for the benchmark fourteen cancer

5

0 10 20 30 40 500.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Number of Base Classifiers

Can

cer

Cla

ssif

icat

ion

Acc

ura

cy

Test Error

Training Error

(a)

0 10 20 30 40 500.75

0.8

0.85

0.9

0.95

1

Number of Base Classifiers

Can

cer

Cla

ssif

icat

ion

Acc

ura

cy

Test Error

Training Error

(b)

Fig. 3. Classification accuracy of a) ensemble-based nearest neigh-bor, and b) ensemble-based naive bayes.

data set proposed in [10]–[13]. The schemes under simulationfor the comparison are summarized as follows:i) SVM-OVO and SVM-OVA [10]: The methods introduced

by Ramaswamy et al. for the fourteen cancer data set isthe well known binary SVM method along with the one-versus-one (OVO) and the one-versus-all (OVA) combinationmethods. For these algorithms, the multicategory classificationproblem is reduced to multiple binary classification tasks.Denoting C the number of classes, the SVM-OVO classifierdiscriminates the samples using C(C−1)

2 hyperplanes whereeach one separates two classes of the fourteen cancer data set,while in the SVM-OVA classifier, there are C hyperplanesdiscriminating each class members from all others.ii) CD-MFS-1 and CD-MFS-2 [13]: Cancer Diagnosis with

Memetic Fuzzy System (CD-MFS) is a fuzzy classifier basedon the memetic algorithm for the cancer classification. This

0 10 20 30 40 500.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Base Classifiers

Can

cer

Cla

ssif

icat

ion

Acc

ura

cy

ENN Classifier

ENB Classifier

ECART Classifier

Fig. 4. Classification accuracy of the ensemble-based k-nearest neighbor(ENN) and the ensemble-based naive bayes (ENB) classifiers versus thenumber of base classifiers compared to the ECART scheme in [14].

TABLE IIICLASSIFICATION ACCURACY ON THE TEST SET WITH DIFFERENT

METHODS FOR THE FOURTEEN CANCER DATA SET.

Classification Method Classification Accuracy (%)ENB 94.00ENN 85.00ELM [12] 78.00SANN [11] 75.00SVM-OVA [10] 72.00CD-MFS-2 [13] 69.43CD-MFS-1 [13] 55.78SVM-OVO [10] 57.40

classifier works based on some fuzzy if-then rules where thenumber of the rules is so large (6d : d = dimension), howeverthe best rule is preserved to reduce the complexity.

iii) ELM [12]: Cancer classification using an ExtremeLearning Machine (ELM) is based on a single-hidden layerfeedforward neural network where the ELM technique is usedinstead of conventional learning algorithms using gradientdescent-based methods.

iv) SANN [11]: In this method, two conventional ANNclassifiers are connected sequentially, where the first stageperforms a preselection on the data set to find the two mostprobable classes for each instance. The second ANN classifierapplies a binary classification for the selected classes.

Table III summarizes the classification accuracy of the ENBand ENN classifiers compared to the classification algorithmsintroduced in [10]–[13] on a separate test set containing 54 testsamples. It is obvious that the ENB and the ENN classifiershave considerably better classification accuracy than the otheralgorithms in the field of multicategory microarray cancerclassification based on the fourteen cancer data set.

Now, we are ready to investigate the robustness of theENB and the ENN schemes in terms of different number offeatures. As previously mentioned, an increase in the numberof features results in the classification accuracy improvement,

6

0 50 100 150 200 250 300 350

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Features

Cla

ssif

icat

ion

Acc

ura

cy

ENN

ENB

SVM−OVA

Fig. 5. Classification accuracy of the ensemble-based k-nearest neighbor(ENN), the ensemble-based naive bayes (ENB) and the SVM-OVA classifiersversus different number of features selected by the MSVM-RFE.

but increasing the number of features causes two significantproblems: i) overfitting on the training data and hence theperformance degradation for so many features, ii) increasingthe computational complexity. In this regard, we are interestedto find the fewest number of features best discriminating thesamples of different classes. Toward this goal, we investigatethe performance of the aforementioned schemes in terms ofdifferent number of features. Fig. 5 demonstrates the accuracyof the ENB and ENN classifiers along with the SVM-OVAscheme for different number of features selected by theMSVM recursive feature elimination block. Simulations areperformed over twenty base classifiers in the ENB and theENN schemes, while the SVM-OVA classifier is simulatedusing the LibSVM package, in which the linear kernel is bestsuited. Complicated kernels such as the RBF kernel degradesthe performance significantly due to the overfitting problem.It is seen from Fig. 5 that the ENB and the ENN schemesare much more stable against the feature elimination than theSVM-OVA method. The ENB classifier has significantly betterperformance for a low number of features (about fifty features)that is one of the attractive aspects of the scheme and it isthe only one for which the accuracy is reduced slightly byincreasing the number of features.

V. CONCLUSION

In this paper, we have investigated two ensemble-basedmethods namely the ENN and the ENB for the practical multi-category microarray cancer data sets in [10]. Simulation resultsshow that our classifiers have considerably better accuracythan that of some conventional classification techniques suchas the SVM-OVA in [10] and the ECART in [14]. Both ENNand ENB classifiers can handle the whole data set withoututilizing feature selection techniques. To reduce the run time ofthe classification process, the MSVM-RFE technique is usedresulting in the classifiers best suited for the multicategory

microarray cancer data sets. The accuracy of the ENB and theENN classifiers fluctuates slightly when the number of featuresis reduced, in particular, the ENB method that displays muchbetter performance than that of the SVM-OVA and the ENNschemes for the small feature set.

REFERENCES

[1] S. Ghorai, A. Mukherjee, S. Sengupta, and P. K. Dutta, “Cancer clas-sification from gene expression data by NPPC ensemble,” IEEE/ACMTrans. on Computational Biology and Bioinformatics, vol. 8, no. 3, pp.659–671, May/June 2011.

[2] H. Zhang, H. Wang, Z. Dai, M. Chen, and Z. Yuan, “Improving accuracyfor cancer classification with a new algorithm for genes selection,” BMCBioinformatics, vol. 13, no. 298, Nov. 2012.

[3] U. R. Muller and M. Nicolau, Microarray Technology and its Applica-tions, Berlin: Springer, 2004.

[4] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, andZ. Yakhini, “Tissue classification with gene expression profiles,” Journalof Computational Biology, vol. 4, no. 3-4, pp. 559–583, 2000.

[5] L.-T. Huang, “An integrated method for cancer classification and ruleextraction from microarray data,” Journal of Biomedical Science, vol.16, no. 25, Feb. 2009.

[6] B.-C. Chien, C.-R. Lin, P.-T. Hou, and R.-M. Chen, “Evolving ensembleclassifiers for incremental face recognition,” in Proc. IEEE InternationalConference on Machine Learning and Cybernetics (ICMLC), July 2012,vol. 4, pp. 1559–1564.

[7] K. Woods, Jr. W. P. Kegelmeyer, and K. Bowyer, “Combination ofmultiple classifiers using local accuracy estimates,” IEEE Trans. onPattern Analysis and Machine Intelligence, vol. 19, no. 4, April 1997.

[8] K.-J. Kim and S.-B. Cho, “Ensemble classifiers based on correlationanalysis for DNA microarray classification,” Neurocomputing, vol. 70,no. 1-3, pp. 187–199, June 2006.

[9] H. Yu and S. Xu, “Simple rule-based ensemble classifiers for cancerDNA microarray data classification,” in Proc. Conference on ComputerScience and Service System (CSSS), June 2011, pp. 2555–2558.

[10] S. Ramaswamy and et al., “Multiclass cancer diagnosis using tumorgene expression signatures,” Proceedings of the National Academy ofSciences of the United States of America (PNAS), vol. 98, no. 26, pp.15149–15154, Dec. 2002.

[11] R. Linder, D. Dew, H. Sudhoff, D. Theegarten, K. Remberger, S. J.Poppel, and M. Wagner, “The subsequent artificial neural network(SANN) approach might bring more classificatory power to ANN-basedDNA microarray analyses,” Bioinformatics, vol. 20, no. 18, pp. 3544–3552, 2004.

[12] R. Zhang, G.-B. Huang, N. Sundararajan, and P. Saratchandran, “Multi-category classification using an extreme learning machine for microarraygene expression cancer diagnosis,” IEEE/ACM Trans. on ComputationalBiology and Bioinformatics, vol. 4, no. 3, pp. 485– 495, July 2007.

[13] A. Z. Shabgahi and M. S. Abadeh, “A fuzzy classification system basedon memetic algorithm for cancer disease diagnosis,” in Proc. 18th IEEEIranian Conference of Biomedical Engineering (ICBME), Dec. 2011.

[14] S. Dudoit, J. Fridlyand, and T. P. Speed, “Comparison of discriminationmethods for the classification of tumors using gene expression data,”Journal of the American Statistical Association, vol. 97, no. 457, pp.77–87, March 2002.

[15] M. Dettling and P. Buhlmann, “Boosting for tumor classification withgene expression data,” Bioinformatics, vol. 19, pp. 1061–1069, 2003.

[16] S. Zhu, D. Wang, K. Yu, T. Li, and Y. Gong, “Feature selection forgene expression using model-based entropy,” IEEE/ACM Trans. onComputational Biology and Bioinformatics, vol. 7, no. 1, pp. 25–36,Jan.-March 2010.

[17] K. Tian, L. Jing, and N. Du, “Sparse representation-based gene selectionfor cancer prediction,” in Proc. International Conference on BiomedicalEngineering and Informatics (BMEI), Oct. 2011, vol. 4, pp. 1789–1793.

[18] K.-B. Duan, J. C. Rajapakse, H. Wang, and F. Azuaje, “Multiple SVM-RFE for gene selection in cancer classification with expression data,”IEEE Trans. on Nanobioscience, vol. 4, no. 3, pp. 228–234, Sept. 2005.

[19] T. M. Mitchell, Machine Learning, McGraw Hill, 2005.[20] E. K. Tang, P. N. Suganthan, and X. Yao, “Gene selection algorithms for

microarray data based on least squares support vector machine,” BMCBioinformatics, vol. 7:95, Feb. 2006.

[21] Y. Freund and R. E. Schapire, “A decision-theoretic generalization ofon-line learning and an application to boosting,” Journal of Computerand System Sciences, vol. 55, pp. 119–139, 1997.