Research Article Integrated Use of Statistical-Based ...downloads.hindawi.com/journals/ddns/2015/261013.pdfIntegrated Use of Statistical-Based Approaches and Computational Intelligence

Research ArticleIntegrated Use of Statistical-Based Approaches andComputational Intelligence Techniques for TumorsClassification Using Microarray

Chia-Ding Hou and Yuehjen E Shao

Department of Statistics and Information Fu Jen Catholic University New Taipei City 24205 Taiwan

Correspondence should be addressed to Yuehjen E Shao stat1003mailfjuedutw

Received 16 February 2015 Revised 1 April 2015 Accepted 9 April 2015

Academic Editor Miguel Angel Lopez

Copyright copy 2015 C-D Hou and Y E Shao This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

With the recent development of biotechnologies cDNA microarray chips are increasingly applied in cancer research Microarrayexperiments can lead to amore thorough grasp of themolecular variations among tumors because they can allow themonitoring ofexpression levels in cells for thousands of genes simultaneously Accordingly how to successfully discriminate the classes of tumorsusing gene expression data is an urgent research issue and plays an important role in carcinogenesis To refine the large dimension ofthe genes data and effectively classify tumor classes this study proposes several hybrid discrimination procedures that combine thestatistical-based techniques and computational intelligence approaches to discriminate the tumor classes A real microarray dataset was used to demonstrate the performance of the proposed approaches In addition the results of cross-validation experimentsreveal that the proposed two-stage hybridmodels aremore efficient in discriminating the acute leukemia classes than the establishedsingle stage models

1 Introduction

The recent development of cDNA microarray technologieshasmade it possible to analyze thousands of genes simultane-ously and has led to the prospect of providing an accurate andefficient means for classifying and diagnosing human can-cers [1ndash20] Advances in microarray discrimination methodpromise to greatly advance cancer diagnosis especially insituations where tumors are clinically atypical The mainchallenge of microarray analysis however is the overwhelm-ing number of genes compared to the smaller number ofavailable tumor samples that is a very large number ofvariables relative to the number of observations [10 21ndash23] As a consequence the issue of developing an accuratediscrimination method for tumor classification using geneexpression data has received considerable attention recently

Many approaches have been proposed for tumor clas-sification using microarray data [10 22ndash33] The existingmethods can be divided into two types the statistical-basedmethods [10 22 24ndash26] and computational intelligence

methods [22 27ndash33] Due to the fact that the dimension of thegenes data is very large but there are only a few observationsavailable it is a must to reduce and refine the whole dataset before we perform the classification tasks While mostrelated works have focused on the use of a single techniquefor tumor classification little research has been done on theintegrated use of several techniques simultaneously to classifytumor classes To achieve the high accuracy for a particu-lar classification problem with smaller computational timehybrid evolutionary computation algorithms are commonlyused for optimizing the resolution process [34ndash36] As aconsequence in this study we aim to develop several effectivetwo-stage hybrid discrimination approaches that integratethe framework of statistical methods and the computationalintelligence methods for tumors classification based on geneexpression data

The remainder of this paper is structured as follows Thesecond section reviews several existing approaches consid-ered in our comparison study The third section addressesthe proposed hybrid approaches for tumors classification

Hindawi Publishing CorporationDiscrete Dynamics in Nature and SocietyVolume 2015 Article ID 261013 8 pageshttpdxdoiorg1011552015261013

2 Discrete Dynamics in Nature and Society

The fourth section shows classification results from the cross-validationThe final section reports the research findings andpresents a conclusion to complete this study

2 Review of Established Methods

Consider a two-class classification problem Let 119909

119894

= (1199091198941

1199091198942 119909119894119903) be the gene expression profile vector where 119909

119894119895is

the expression level of the 119895th gene in the 119894th tumor sample119894 = 1 2 119899 119895 = 1 2 119903 Let 119884

119894be a binary disease status

variable (1 for case group 1205871 and minus1 for control group 1205872 asa general example) Accordingly the microarray data may besummarized as the following set

(1198841 1199091) (1198842 119909

2) (119884

119899 119909

119899) (1)

The following sections briefly review several well-knownestablished microarray classification methods

21 Fisherrsquos Linear Discriminant Analysis With the use ofgene expression data several studies proposed to applyFisherrsquos linear discriminant analysis (FLDA) to classify anddiagnose cancer [10 22 24] Assume that independentobservation vectors 119911

11 119911

12 119911

11198991

and 119911

21 119911

22 119911

21198992

areobtained from the two known groups 1205871 and 1205872 respectivelyLet

119911

1=

11198991

1198991

sum

119894=1119911

1119894

119911

2=

11198992

1198992

sum

119894=1119911

2119894

119878 =

1119899

[

1198991

sum

119894=1(119911

1119894 minus 119911

1) (119911

1119894 minus 119911

1)

1015840

+

1198992

sum

119894=1(119911

2119894 minus 119911

2) (119911

2119894 minus 119911

2)

1015840

]

(2)

where

119899 = 1198991 + 1198992 minus 2 (3)

To classify new observation 119911

0 we can utilize the follow-

ing FLDA allocation rule

(119911

1minus 119911

2)

1015840

119878minus1

119911

0 minus

12(119911

1minus 119911

2)

1015840

119878minus1

(119911

1+ 119911

2) ge 0

allocate 119911

0to group 1205871

(119911

1minus 119911

2)

1015840

119878minus1

119911

0 minus

12(119911

1minus 119911

2)

1015840

119878minus1

(119911

1+ 119911

2) lt 0

allocate 119911

0to group 1205872

(4)

22 Logistic Regression The microarray discriminationapproach with the use of logistic regression (LR) modelwas also studied for disease classification [22 25 26]

The structure of the logistic regression model can be brieflydescribed as follows Let

119875119894= Pr [119884

119894= 1 | 119909

1198941 1199091198942 119909119894119903] 119894 = 1 2 119899 (5)

be the conditional probability of event 119884119894= 1 under a given

series of independent variables (1199091198941 1199091198942 119909119894119903) The logistic

regression model then is defined as follows

ln(

119875119894

1 minus 119875119894

) = 1205730 +119903

sum

119895=1120573119895119909119894119895 (6)

Collinearity diagnosis procedure should be conductedfirst to exclude variables exhibiting high collinearity Aftercollinearity diagnosis the remaining variables are then usedfor logistic regressionmodeling and testing Afterward usinglogistic regression with Wald-forward method we can iden-tify significant independent variables say 119909

lowast

1198941 119909lowast

1198942 119909lowast

119894119896

and obtain a significance model

119894=

exp (1205730 + sum

119896

119895=1120573119895119909lowast

119894119895)

1 + exp (1205730 + sum

119896

119895=1120573119895119909lowast

119894119895)

119894 = 1 2 119899 (7)

23 Artificial Neural Network Based on gene expressionprofiles the artificial neural network (ANN) has also beenused to discriminate the tumor classes [22 27ndash29]The ANNframework includes the input the output and the hiddenlayersThe nodes in the input layer receive input signals froman external source and the nodes in the output layer providethe target output signals For each neuron 119895 in the hiddenlayer and neuron 119896 in the output layer the net inputs are givenby

net119895= sum

119894

119908119895119894times 119900119894

net119896= sum

119895

119908119896119895

times 119900119895

(8)

where 119894 (119895) is a neuron in the previous layer 119908119895119894

(119908119896119895) is the

connection weight from neuron 119894 (119895) to neuron 119895 (119896) and119900119894(119900119895) is the output of node 119894 (119895) The sigmoid functions are

given by

119900119894= net119894

119900119894=

11 + expminus(net119894+120579119894)

= 119891119894(net119894 120579119894)

119900119896=

11 + expminus(net119896+120579119896)

= 119891119896(net119896 120579119896)

(9)

where net119895(net119896) is the input signal from the external source

to the node 119895 (119896) in the input layer and 120579119895(120579119896) is a bias

The conventional technique used to derive the connectionweights of the feedforward network is the generalized deltarule [37]

Discrete Dynamics in Nature and Society 3

24 Support Vector Machine To classify tumor classes usingmicroarray data the discrimination method with the useof support vector machine (SVM) has also been discussed[22 30ndash33]The structure of SVMalgorithm can be describedas follows Let (119910

119894 119909

119894)119899

119894=1 119909119894

isin 119877119903 119910119894

isin minus1 1 be thetraining set with input vectors and labels where 119899 is thenumber of sample observations and 119903 is the dimension ofeach observation and 119910

119894is known target The algorithm is

to seek the hyperplane 119908

1015840sdot 119909

119894+ 119887 = 0 where 119908

is the vectorof hyperplane and 119887 is a bias term to separate the data fromtwo classes with maximal margin width 2119908

2 In order to

obtain the optimal hyperplane the SVMwas used to solve thefollowing optimization problem

Min Φ(119909

) =

121003817100381710038171003817119908

1003817100381710038171003817

2

st 119910119894(119908

1015840119909

119894+ 119887) ge 1 119894 = 1 2 119899

(10)

Because it is difficult to solve (10) SVM transformsthe optimization problem to be dual problem by Lagrangemethod The value of 120572 in the Lagrange method must benonnegative real coefficients Equation (10) is transformedinto the following constrained form [38]

Max Φ(119908 119887 120585 120572 120573)

=

119873

sum

119894=1120572119894minus

12

119873

sum

119894=1119895=1120572119894120572119895119910119894119910119895119909119879

119894119909119895

st119899

sum

119895=1120572119895119910119895= 0 0 le 120572

119894le 119862 119894 = 1 2 119899

(11)

In (11) 119862 is the penalty factor and determines the degreeof penalty assigned to an error Typically it could not findthe linear separate hyperplane for all application data Forproblems that can not be linearly separated in the input spaceSVM employs the kernel method to transform the originalinput space into a high dimensional feature space wherean optimal linear separating hyperplane can be found Thecommon kernel functions are linear polynomial radial basisfunction (RBF) and sigmoid Although several choices forthe kernel function are available the most widely used kernelfunction is the RBF which is defined as [39]

119870(119909119894 119909119895) = exp (minus120574

10038171003817100381710038171003817119909119894minus119909119895

10038171003817100381710038171003817

2) 120574 ge 0 (12)

where 120574 denotes the width of the RBF Consequently the RBFis used in this study and the multiclass SVM method is usedin this study [40]

25 Multivariate Adaptive Regression Splines The multi-variate adaptive regression splines (MARS) have also beenapplied for tumor classification using gene expression data[22 30] The general MARS function can be represented asfollows

119891 (119909) = 1198870 +

119872

sum

119898=1119887119898

119870119898

prod

119896=1[119878119896119898

(119909](119896119898) minus 119905119896119898

)] (13)

where 1198870and 119887

119898are the parameters 119872 is the number of

basis functions (BF) 119870119898is the number of knots 119878

119896119898takes

on values of either 1 or minus1 and indicates the right or leftsense of the associated step function ](119896119898) is the label ofthe independent variable and 119905

119896119898is the knot location The

optimal MARS model is chosen in a two-step procedureFirstly construct a large number of basis functions to fit thedata initially Secondly basis functions are deleted in orderof least contribution using the generalized cross-validation(GCV) criterion Tomeasure the importance of a variable wecan observe the decrease in the calculated GCV values whena variable is removed from the model The GCV is defined asfollows

LOF (119891119872) = GCV (119872) =

(1119899)sum119899119894=1 [119910119894 minus

119891119872

(119909119894)]

2

[1 minus 119862 (119872) 119899]2

(14)

where 119899 is the observations and 119862(119872) is the cost penaltymeasures of a model containing119872 basis function

3 The Proposed HybridDiscrimination Methods

The two-stage hybrid procedure is commonly used in variousfields such as financial distress warning system [41 42]medical area [43] statistical inference [44 45] and statisticalprocess control [36 46ndash48] To obtain the best accuracyfor a specific classification problem hybrid evolutionarycomputation algorithms are commonly used to optimize theresolution process [34ndash36] In this section several two-stagehybrid discrimination methods that integrate the frameworkof statistical-based approaches and computational intelli-gence methods are proposed for tumor classification basedon gene expression microarray data

The proposed methods include five components theFLDA the LR model the MARS model the ANN andthe SVM classifiers The proposed hybrid discriminationmethods combine the statistical-based discrimination meth-ods and computational intelligence methods In stage 1influencing variables are selected using LR orMARS In stage2 the selected important influencing variables are then takenas the input variables of FLDA LR ANN SVM or MARSThe following sections address the proposed approaches

31 Two-Stage Hybrid Method of LR and Various Classifiers

Stage 1 Substitute independent variables 1199091198941 1199091198942 119909119894119903 and

dependent variable 119910119894into logistic regression Apply logistic

regression with Wald-forward method to identify significantindependent variables say 119909lowast

1198941 119909lowast

1198942 119909lowast

119894119896

Stage 2 Substitute the significant independent variables119909lowast

1198941 119909lowast

1198942 119909lowast

119894119896obtained in Stage 1 and dependent variable

119910119894into various classifiers such as FLDA ANN SVM or

MARS The obtained corresponding hybrid methods arereferred to as the LR-FLDA LR-ANN LR-SVM and LR-MARS respectively


Table 1 The influencing genes selected by using two-sample 119905-test with a significance level of 00001

Variables Gene description Mean SD 119875 value

1199091 CMKBR7 chemokine (C-C) receptor 7 AML 6812 14555 000005ALL minus4802 8269

1199092LAMP2 lysosome-associated membrane protein 2 alternativeproducts

AML 17160 11380 000001ALL 6221 8022

1199093 Quiescin (Q6) mRNA partial cds AML 153492 107094 000006ALL 71577 55919

1199094 Peptidyl-prolyl CIS-TRANS isomerase mitochondrial precursor AML 29944 27060 000006ALL 3464 11062

1199095 Transmembrane protein mRNA AML 9004 7282 000003ALL 1898 5912

1199096 PGD phosphogluconate dehydrogenase AML 97052 62100 000002ALL 48091 31341

1199097 Canalicular multispecific organic anion transporter (cMOAT) AML 4256 6305 000006ALL 13185 8413

1199098 Huntingtin interacting protein (HIP1) mRNA AML minus940 12107 000000ALL minus13609 12005

1199099ME491 gene extracted from H sapiens gene for Me491CD63antigen

AML 202680 165810 000001ALL 74770 54840

11990910GB DEF = nonmuscle myosin heavy chain-B (MYH10) mRNApartial cds

AML 24312 10991 000002ALL 48681 33069

11990911P4HB procollagen-proline 2-oxoglutarate 4-dioxygenase(proline 4-hydroxylase) beta polypeptide

AML 201560 138483 000003ALL 101583 50340

32 Two-Stage Hybrid Method of MARS andVarious Classifiers


dependent variable 119910119894into multivariate adaptive regression

splines Use multivariate adaptive regression splines to iden-tify significant independent variables say 119909lowast

1198941 119909lowast

1198942 119909lowast

119894119896


1198941 119909lowast

1198942 119909lowast


119910119894into various classifiers such as FLDA LR ANN or SVM

The corresponding hybrid methods are referred to as theMARS-FLDA MARS-LR MARS-ANN and MARS-SVMrespectively

4 The Cross-Validation Experiments

This study performs a series of cross-validation experimentsto compare the proposed approaches with those previouslydiscussed in literature This study considers a leukemiadataset that was first described by Golub et al [5] and wasexamined in Dudoit et al [10] and Lee et al [22] Thisdataset contains 6817 human genes and was obtained fromAffymetrix high-density oligonucleotide microarrays Thedata consist of 25 cases of acutemyeloid leukemia (AML) and47 cases of acute lymphoblastic leukemia (ALL)

Since the dimension of the data is very large (119903 = 6817)but there are only a few observations (119899 = 72) it is essentialto reduce and refine the whole set of genes (independentvariables) before we can construct the discrimination model

Table 2 Collinearity diagnosis for LR modeling

Variables 1199091 1199092 1199093 1199094 1199095 1199096 1199097 1199098 1199099 11990910 11990911

VIF 209 297 276 256 161 346 133 188 685 158 547

To refine the set of genes Golub et al [5] Dudoit et al [10]and Lee et al [22] proposed the methods of subjective ratiosto select genes It is well known that the two-sample 119905-test isthe most popular test to test for the differences between twogroups in means For the sake of strictness instead of usinga somewhat arbitrary criterion like that used in Golub et al[5] Dudoit et al [10] or Lee et al [22] this study applies thetwo-sample 119905-test with a significance level of 00001 to selectthe influencing genes The results are given in Table 1

The significant variables selected using two-sample 119905-testare then served as the input variables of the established singlestage discrimination methods reviewed in Section 2 and theproposed two-stage hybrid methods introduced in Section 3To examine the presence of collinearity the variance inflationfactor (VIF) was calculated As shown in Table 2 all thevalues of VIFs are less than 10 Consequently there is no highcollinearity among these variables In addition this studyadopts the suggestions of Dudoit et al [10] and Lee et al [22]and performs a 2 1 cross-validation (training set test set)

The difficulty with ANN is that the design parameterssuch as the number of hidden layers and the number ofneurons in each layer have to be set before training processcan proceed User has to select the ANN structure and set thevalues of certain parameters for the ANN modeling process


gt Find the best parameter gammaampcostgt plt-seq(-111)gt objlt-tunesvm(ysim data=train sampling=cross gamma=2and(p) cost=2and(p))gt objParameter tuning of lsquosvmrsquo- sampling method 10-fold cross validation- best parametersgamma cost05 2

gt Building the SVMmodelgt svmmodellt-svm(ysim data=train type=C-classification gamma=obj$bestparameters[[1]] cost=obj$bestparameters[[2]])gt Classification capability Traingt svmpredlt-predict(svmmodel train)gt tablt-table(predict=svmpred true=train[1])gt tab

truepredict 0 1

0 17 01 0 31

gt cat(Accurate Classification Rate = 100lowastsum(dig(tab))sum(tab) n)Accurate Classification Rate = 100 gt Classification capability Testgt svmpredlt-predict(svmmodel test)gt tablt-table(predict=svmpred true=test[1])gt tab

truepredict 0 1

0 2 11 6 15

gt cat(Accurate Classification Rate = 100lowastsum(dig(tab))sum(tab) n)Accurate Classification Rate = 7083333

Algorithm 1 The SVMmodeling output

However there is no general and explicit approach to selectoptimal parameters for the ANN models [49] Accordinglythe selection of design parameters for ANNmay be based onthe trial and error procedure

This study employs the highest accurate classificationrate (ACR) as the criterion for selecting the ANN topologyThe topology is defined as 119899

119894-119899ℎ-119899119900-119871 where it stands

for the number of neurons in the input layer number ofneurons in the hidden layer number of neurons in theoutput layer and learning rate respectively Actually too fewhidden nodes would limit the network generation capabilitywhile too many hidden nodes may result in overtraining ormemorization by the network Since there are 11 input nodesand one output node used in this study the numbers ofhidden nodes to test were selected as 9 10 11 12 and 13 Thelearning rates are chosen as 01 001 or 0001 respectivelyAfter performing the ANN modeling this study found thatthe 11-9-1-001 topology has the best ACR results

This study also performed the SVM modeling to themicroarray dataset The two parameters 119862 and 120574 are themost important factors to affect the performance of SVMThegrid search method uses exponentially growing sequences of119862 and 120574 to determine good parameters The parameter setof 119862 and 120574 which generates the highest ACR is considered

Table 3 The relative importance of four explanatory variables forMARS modelling

Function Variable Relative importance ()1 119909

21000

2 1199097

7203 119909

8427

4 1199096

263

to be ideal set Here the best two parameter values for 119862

and 120574 are 2 and 05 respectively The SVM package wasperformed in running the dataset and the correspondingoutput is displayed in Algorithm 1 Observing Algorithm 1 inthe case of 119862 = 2 and 120574 = 05 we can have ACR = 100 forthe initial training stage Consequently in the testing stagewe are able to obtain ACR = 25 and ACR = 9375 for AMLand ALL respectively by using the same parameter settings(ie 119862 = 2 and 120574 = 05) Accordingly the ACR = 7083 forthe case of full sample

For MARS modeling the results are displayed in Table 3During the selection process four important explanatoryvariables were chosen The corresponding relative impor-tance indicators are showed in Table 3 As a consequence


Table 4 ACRs for thirteen approaches using cross-validation

Method ACRAML ALL Full sample

Single stageFLDA 3750 9375 7500LR 6250 8750 7917ANN 5000 9375 7917SVM 2500 9375 7083MARS 5000 7500 6667

Two-stageLR-FLDA 6250 8125 7500LR-ANN 5000 9375 7917LR-SVM 7500 8125 7917LR-MARS 6250 9375 8333MARS-FLDA 7500 7500 7500MARS-LR 7500 7500 7500MARS-ANN 3750 9375 7500MARS-SVM 6250 8750 7917

those four important variables would be served as the inputvariables for hybridmodeling process In addition the resultsof ACR for each modeling are listed in Table 4

The rationale behind the proposed hybrid discriminationmethod is to obtain the fewer but more informative variablesby performing the first stage LR or MARS modeling Theselected significant variables are then served as the inputs forthe second stage of discrimination approach In this study thesignificant variables selected by performing LR and MARSmodeling are variables 1199091 1199092 1199097 and 1199098 and variables 1199092 11990961199097 and 1199098 respectively For the hybrid LR-ANN model the4-6-1-001 topology provided the best ACR results For theMARS-ANN hybrid model the 4-6-1-001 topology alsogave the best ACR results Additionally for both LR-SVMandMARS-SVM modeling the best two parameter values for 119862

and 120574 are the same and they are 2 and 05 respectivelyFor each of the thirteen different approaches FLDA

LR ANN SVM MARS LR-FLDA LR-ANN LR-SVM LR-MARS MARS-FLDA MARS-LR MARS-ANN and MARS-SVM this study presents the corresponding ACRs in Table 4By comparing the ACR with AML while the LR has highestACR (ie 6250) among the 5 single stage methods bothLR-SVM and MARS-LR have the highest ACR (ie 7500)among the 8 two-stage methods Apparently the two-stagemethods provide a better classification performance Bycomparing the ACR with ALL the single stage methods ofFLDA ANN and SVM give the highest ACR (ie 9375)and the two-stage methods of LR-ANN LR-MARS andMARS-ANN have the same ACR (ie 9375) It seems thatthe single stage and two-stage methods achieve a similarperformance As shown in Table 4 it can be seen that amongthe thirteen methods mentioned above the two-stage hybridmodel of LR-MARShas the highestACRs (ie 8333) for thefull sample As a consequence the proposed two-stage hybridapproaches aremore efficient for tumor classification than theestablished single stage methods

Table 5 Overall averaged ACR and the associated standard error(in parentheses) for single stage and two-stage methods

MethodACR

AML ALL Full sample

Single stage 4500(1425)

8875(815)

7417(543)

Two-stage 6250(1336)

8516(814)

7761(310)

012625

4291

05

101520253035404550

ALL FullAMLIm

prov

emen

t (

)

Figure 1 Improvement of the proposed approach in comparisonwith the single stage method

In addition Table 5 lists the overall averaged ACRs andthe associated standard errors (in parentheses) for singlestage and two-stage methods In comparison to the singlestage and the proposed two-stage methods in Table 5one is able to observe that our proposed methods almostprovide more accurate results than the single stage methodsAlthough the single stage methods have larger averagedACR value than two-stage methods in classifying ALL thedifference is not too significant In addition observing Table 5it can be found that the proposed two-stage approaches havethe smaller standard errors for all the cases which imply therobustness of the mechanisms Figure 1 provides a compar-ison with respect to the overall improvement percentage inthe single stage method From Figure 1 it can be seen that thetwo-stage approaches are more robust than the single stagemethod

5 Conclusions

This study proposes several two-stage hybrid discriminationapproaches for tumor classification using microarray dataThe proposed approaches integrate the framework of severalfrequently used statistical-based discriminationmethods andcomputational intelligence classifying techniques Based onthe results of cross-validation in Table 4 it can be easilyobserved that the proposed hybridmethodLR-MARS ismoreappropriate for discriminating the tumor classes

Computational intelligencemethodology is very useful inmany aspects of application and can deal with complex andcomputationally intensive problems With the use of severalcomputational intelligence techniques this study develops


two-stage hybrid discrimination approach for tumor classi-ficationThe proposed hybrid model is not the only discrimi-nationmethod that can be employed Based on our work fur-ther research can be expanded For example one can combineother computational intelligence techniques such as roughset theory [50] or extreme learning machine with neuralnetworks or support vector machine to refine the structurefurther and improve the classification accuracy Extensionsof the proposed two-stage hybrid discrimination method toother statistical techniques or to multistage discriminationprocedures are also possible Such works deserve furtherresearch and are our future concern

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work is partially supported by the Ministry of Scienceand Technology of China Grant no MOST 103-2118-M-030-001 and Grant no MOST 103-2221-E-030-021

References

[1] J LDeRisi V R Iyer andPO Brown ldquoExploring themetabolicand genetic control of gene expression on a genomic scalerdquoScience vol 278 no 5338 pp 680ndash686 1997

[2] R J Cho M J Campbell E AWinzeler et al ldquoA genome-widetranscriptional analysis of the mitotic cell cyclerdquoMolecular Cellvol 2 no 1 pp 65ndash73 1998

[3] S Chu J DeRisi M Eisen et al ldquoThe transcriptional programof sporulation in budding yeastrdquo Science vol 282 no 5389 pp699ndash705 1998

[4] U Alon N Barka D A Notterman et al ldquoBroad patternsof gene expression revealed by clustering analysis of tumorand normal colon tissues probed by oligonucleotide arraysrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 96 no 12 pp 6745ndash6750 1999

[5] T R Golub D K Slonim P Tamayo et al ldquoMolecularclassification of cancer class discovery and class prediction bygene expressionmonitoringrdquo Science vol 286 no 5439 pp 531ndash527 1999

[6] C M Perou S S Jeffrey M van de Rijn et al ldquoDistinctive geneexpression patterns in human mammary epithelial cells andbreast cancersrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 96 no 16 pp 9212ndash92171999

[7] J R Pollack C M Perou A A Alizadeh et al ldquoGenome-wide analysis of DNA copy-number changes using cDNAmicroarraysrdquo Nature Genetics vol 23 no 1 pp 41ndash46 1999

[8] A A Alizadeh M B Elsen R E Davis et al ldquoDistinct typesof diffuse large B-cell lymphoma identified by gene expressionprofilingrdquo Nature vol 403 no 6769 pp 503ndash511 2000

[9] S Ramaswamy P Tamayo R Rifkin et al ldquoMulticlass cancerdiagnosis using tumor gene expression signaturesrdquo Proceedingsof the National Academy of Sciences of the United States ofAmerica vol 98 no 26 pp 15149ndash15154 2001

[10] S Dudoit J Fridlyand and T P Speed ldquoComparison of dis-criminationmethods for the classification of tumors using geneexpression datardquo Journal of the American Statistical Associationvol 97 no 457 pp 77ndash87 2002

[11] J J Liu G Cutler W Li et al ldquoMulticlass cancer classificationand biomarker discovery using GA-based algorithmsrdquo Bioinfor-matics vol 21 no 11 pp 2691ndash2697 2005

[12] L Ziaei A R Mehri and M Salehi ldquoApplication of artificialneural networks in cancer classification and diagnosis pre-diction of a subtype of lymphoma based on gene expressionprofilerdquo Journal of Research in Medical Sciences vol 11 no 1 pp13ndash17 2006

[13] Z Wang Y Wang J Xuan et al ldquoOptimized multilayerperceptrons for molecular classification and diagnosis usinggenomic datardquo Bioinformatics vol 22 no 6 pp 755ndash761 2006

[14] K V G Rao P P Chand and M V R Murthy ldquoA neuralnetwork approach in medical decision systemsrdquo Journal ofTheoretical and Applied Information Technology vol 3 pp 97ndash101 2007

[15] H Rattikorn and K Phongphun ldquoTumor classification rankingfrom microarray datardquo BMC Genomics vol 9 no 2 article S212008

[16] L-T Huang ldquoAn integrated method for cancer classificationand rule extraction frommicroarray datardquo Journal of BiomedicalScience vol 16 article 25 10 pages 2009

[17] T Y Yang ldquoEfficient multi-class cancer diagnosis algorithmusing a global similarity patternrdquo Computational Statistics ampData Analysis vol 53 no 3 pp 756ndash765 2009

[18] H Pang K Ebisu E Watanabe L Y Sue and T TongldquoAnalysing breast cancer microarrays from African Amer-icans using shrinkage-based discriminant analysisrdquo HumanGenomics vol 5 no 1 pp 5ndash16 2010

[19] N B Dawany W N Dampier and A Tozeren ldquoLarge-scaleintegration of microarray data reveals genes and pathwayscommon to multiple cancer typesrdquo International Journal ofCancer vol 128 no 12 pp 2881ndash2891 2011

[20] R Pillai RDeeter C T Rigl et al ldquoValidation and reproducibil-ity of a microarray-based gene expression test for tumor identi-fication in formalin-fixed paraffin-embedded specimensrdquo TheJournal of Molecular Diagnostics vol 13 no 1 pp 48ndash56 2011

[21] D Ghosh ldquoPenalized discriminant methods for the classifica-tion of tumors from gene expression datardquo Biometrics vol 59no 4 pp 992ndash1000 2003

[22] J W Lee J B Lee M Park and S H Song ldquoAn extensivecomparison of recent classification tools applied to microarraydatardquo Computational Statistics amp Data Analysis vol 48 no 4pp 869ndash885 2005

[23] M S Srivastava and T Kubokawa ldquoComparison of discrimina-tion methods for high dimensional datardquo Journal of the JapanStatistical Society vol 37 no 1 pp 123ndash134 2007

[24] R Bermudo D Abia A Mozos et al ldquoHighly sensitivemolecular diagnosis of prostate cancer using surplus materialwashed off from biopsy needlesrdquo British Journal of Cancer vol105 no 10 pp 1600ndash1607 2011

[25] W Li F Sun and I Grosse ldquoExtreme value distribution basedgene selection criteria for discriminantmicroarray data analysisusing logistic regressionrdquo Journal of Computational Biology vol11 no 2-3 pp 215ndash226 2004

[26] J G Liao and K V Chin ldquoLogistic regression for diseaseclassification using microarray data model selection in a largep and small n caserdquo Bioinformatics vol 23 no 15 pp 1945ndash19512007


[27] S Gruvberger M Ringner Y Chen et al ldquoEstrogen receptorstatus in breast cancer is associated with remarkably distinctgene expression patternsrdquo Cancer Research vol 61 no 16 pp5979ndash5984 2001

[28] J Khan J S Wei M Ringner et al ldquoClassification anddiagnostic prediction of cancers using gene expression profilingand artificial neural networksrdquoNatureMedicine vol 7 no 6 pp673ndash679 2001

[29] L J Lancashire C Lemetre and G R Ball ldquoAn introductionto artificial neural networks in bioinformaticsmdashapplication tocomplex microarray and mass spectrometry datasets in cancerstudiesrdquo Briefings in Bioinformatics vol 10 no 3 pp 315ndash3292009

[30] M R Segal K D Dahlquist and B R Conklin ldquoRegressionapproaches for microarray data analysisrdquo Journal of Computa-tional Biology vol 10 no 6 pp 961ndash980 2003

[31] A Dragomir and A Bezerianos ldquoImproving gene expressionsample classification using support vector machine ensemblesaggregated by boostingrdquo Cancer Genomics amp Proteomics vol 3no 1 pp 63ndash70 2006

[32] R Zhang G-B Huang N Sundararajan and P Saratchan-dran ldquoMulticategory classification using an extreme learningmachine for microarray gene expression cancer diagnosisrdquoIEEEACM Transactions on Computational Biology and Bioin-formatics vol 4 no 3 pp 485ndash494 2007

[33] X Wang and R Simon ldquoMicroarray-based cancer predictionusing single genesrdquoBMCBioinformatics vol 12 article 391 2011

[34] K Y Chan C K Kwong and Y C Tsim ldquoModelling andoptimization of fluid dispensing for electronic packaging usingneural fuzzy networks and genetic algorithmsrdquo EngineeringApplications of Artificial Intelligence vol 23 no 1 pp 18ndash262010

[35] K Y Chan T S Dillon and C K Kwong ldquoModeling of a liquidepoxy molding process using a particle swarm optimization-based fuzzy regression approachrdquo IEEE Transactions on Indus-trial Informatics vol 7 no 1 pp 148ndash158 2011

[36] Y E Shao and C-D Hou ldquoChange point determination for amultivariate process using a two-stage hybrid schemerdquo AppliedSoft Computing vol 13 no 3 pp 1520ndash1527 2013

[37] D E Rumelhart and J L McClelland Explorations in theMicrostructure of Cognition 1 MIT Press 1986

[38] VN VapnikTheNature of Statistical LearningTheory SpringerBerlin Germany 2000

[39] V Cherkassky and Y Ma ldquoPractical selection of SVM parame-ters and noise estimation for SVM regressionrdquoNeural Networksvol 17 no 1 pp 113ndash126 2004

[40] C-W Hsu and C-J Lin ldquoA comparison of methods for mul-ticlass support vector machinesrdquo IEEE Transactions on NeuralNetworks vol 13 no 2 pp 415ndash425 2002

[41] S L Lin C D Hou and P H Gi ldquoDo the two-stage hybridmodels outperform the conventional techniques Evidence inTaiwanrdquo International Journal of Business and Strategy vol 9pp 98ndash131 2008

[42] S L Lin ldquoA new two-stage hybrid approach of credit risk inbanking industryrdquo Expert Systems with Applications vol 36 no4 pp 8333ndash8341 2009

[43] S-M Chou T-S Lee Y E Shao and I-F Chen ldquoMiningthe breast cancer pattern using artificial neural networks andmultivariate adaptive regression splinesrdquo Expert Systems withApplications vol 27 no 1 pp 133ndash142 2004

[44] R Modarres and J L Gastwirth ldquoHybrid test for the hypothesisof symmetryrdquo Journal of Applied Statistics vol 25 no 6 pp 777ndash783 1998

[45] R Tang M Banerjee and G Michailidis ldquoA two-stage hybridprocedure for estimating an inverse regression functionrdquo TheAnnals of Statistics vol 39 no 2 pp 956ndash989 2011

[46] W Bischoff and F Miller ldquoA minimax two-stage procedure forcomparing treatments looking at a hybrid test and estimationproblemas awholerdquo Statistica Sinica vol 12 no 4 pp 1133ndash11442002

[47] C-J Lu Y E Shao and P-H Li ldquoMixture control chartpatterns recognition using independent component analysisand support vector machinerdquo Neurocomputing vol 74 no 11pp 1908ndash1914 2011

[48] Y E Shao and C D Hou ldquoFault identification in industrialprocesses using an integrated approach of neural network andanalysis of variancerdquoMathematical Problems in Engineering vol2013 Article ID 516760 7 pages 2013

[49] Z W Zhong L P Khoo and S T Han ldquoPrediction ofsurface roughness of turned surfaces using neural networksrdquoInternational Journal of Advanced Manufacturing Technologyvol 28 no 7-8 pp 688ndash693 2006

[50] Y E Shao C-D Hou and C-C Chiu ldquoHybrid intelligentmodeling schemes for heart disease classificationrdquo Applied SoftComputing Journal vol 14 pp 47ndash52 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of


Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of


Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of


Mathematical PhysicsAdvances in

Complex AnalysisJournal of


OptimizationJournal of


CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of


Operations ResearchAdvances in

Journal of


Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences


The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Algebra

Discrete Dynamics in Nature and Society



Decision SciencesAdvances in

Discrete MathematicsJournal of


Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of


The fourth section shows classification results from the cross-validationThe final section reports the research findings andpresents a conclusion to complete this study

2 Review of Established Methods

Consider a two-class classification problem Let 119909

119894

= (1199091198941

1199091198942 119909119894119903) be the gene expression profile vector where 119909

119894119895is

the expression level of the 119895th gene in the 119894th tumor sample119894 = 1 2 119899 119895 = 1 2 119903 Let 119884

119894be a binary disease status

variable (1 for case group 1205871 and minus1 for control group 1205872 asa general example) Accordingly the microarray data may besummarized as the following set

(1198841 1199091) (1198842 119909

2) (119884

119899 119909

119899) (1)

The following sections briefly review several well-knownestablished microarray classification methods

21 Fisherrsquos Linear Discriminant Analysis With the use ofgene expression data several studies proposed to applyFisherrsquos linear discriminant analysis (FLDA) to classify anddiagnose cancer [10 22 24] Assume that independentobservation vectors 119911

11 119911

12 119911

11198991

and 119911

21 119911

22 119911

21198992

areobtained from the two known groups 1205871 and 1205872 respectivelyLet

119911

1=

11198991

1198991

sum

119894=1119911

1119894

119911

2=

11198992

1198992

sum

119894=1119911

2119894

119878 =

1119899

[

1198991

sum

119894=1(119911

1119894 minus 119911

1) (119911

1119894 minus 119911

1)

1015840

+

1198992

sum

119894=1(119911

2119894 minus 119911

2) (119911

2119894 minus 119911

2)

1015840

]

(2)

where

119899 = 1198991 + 1198992 minus 2 (3)

To classify new observation 119911

0 we can utilize the follow-

ing FLDA allocation rule

(119911

1minus 119911

2)

1015840

119878minus1

119911

0 minus

12(119911

1minus 119911

2)

1015840

119878minus1

(119911

1+ 119911

2) ge 0

allocate 119911

0to group 1205871

(119911

1minus 119911

2)

1015840

119878minus1

119911

0 minus

12(119911

1minus 119911

2)

1015840

119878minus1

(119911

1+ 119911

2) lt 0

allocate 119911

0to group 1205872

(4)

22 Logistic Regression The microarray discriminationapproach with the use of logistic regression (LR) modelwas also studied for disease classification [22 25 26]

The structure of the logistic regression model can be brieflydescribed as follows Let

119875119894= Pr [119884

119894= 1 | 119909

1198941 1199091198942 119909119894119903] 119894 = 1 2 119899 (5)

be the conditional probability of event 119884119894= 1 under a given

series of independent variables (1199091198941 1199091198942 119909119894119903) The logistic

regression model then is defined as follows

ln(

119875119894

1 minus 119875119894

) = 1205730 +119903

sum

119895=1120573119895119909119894119895 (6)

Collinearity diagnosis procedure should be conductedfirst to exclude variables exhibiting high collinearity Aftercollinearity diagnosis the remaining variables are then usedfor logistic regressionmodeling and testing Afterward usinglogistic regression with Wald-forward method we can iden-tify significant independent variables say 119909

lowast

1198941 119909lowast

1198942 119909lowast

119894119896

and obtain a significance model

119894=

exp (1205730 + sum

119896

119895=1120573119895119909lowast

119894119895)

1 + exp (1205730 + sum

119896

119895=1120573119895119909lowast

119894119895)

119894 = 1 2 119899 (7)

23 Artificial Neural Network Based on gene expressionprofiles the artificial neural network (ANN) has also beenused to discriminate the tumor classes [22 27ndash29]The ANNframework includes the input the output and the hiddenlayersThe nodes in the input layer receive input signals froman external source and the nodes in the output layer providethe target output signals For each neuron 119895 in the hiddenlayer and neuron 119896 in the output layer the net inputs are givenby

net119895= sum

119894

119908119895119894times 119900119894

net119896= sum

119895

119908119896119895

times 119900119895

(8)

where 119894 (119895) is a neuron in the previous layer 119908119895119894

(119908119896119895) is the

connection weight from neuron 119894 (119895) to neuron 119895 (119896) and119900119894(119900119895) is the output of node 119894 (119895) The sigmoid functions are

given by

119900119894= net119894

119900119894=

11 + expminus(net119894+120579119894)

= 119891119894(net119894 120579119894)

119900119896=

11 + expminus(net119896+120579119896)

= 119891119896(net119896 120579119896)

(9)

where net119895(net119896) is the input signal from the external source

to the node 119895 (119896) in the input layer and 120579119895(120579119896) is a bias

The conventional technique used to derive the connectionweights of the feedforward network is the generalized deltarule [37]



119894 119909

119894)119899

119894=1 119909119894

isin 119877119903 119910119894




1015840sdot 119909

119894+ 119887 = 0 where 119908


2 In order to


Min Φ(119909

) =

121003817100381710038171003817119908

1003817100381710038171003817

2

st 119910119894(119908

1015840119909

119894+ 119887) ge 1 119894 = 1 2 119899

(10)


Max Φ(119908 119887 120585 120572 120573)

=

119873

sum

119894=1120572119894minus

12

119873

sum

119894=1119895=1120572119894120572119895119910119894119910119895119909119879

119894119909119895

st119899

sum

119895=1120572119895119910119895= 0 0 le 120572

119894le 119862 119894 = 1 2 119899

(11)


119870(119909119894 119909119895) = exp (minus120574

10038171003817100381710038171003817119909119894minus119909119895

10038171003817100381710038171003817

2) 120574 ge 0 (12)



119891 (119909) = 1198870 +

119872

sum

119898=1119887119898

119870119898

prod

119896=1[119878119896119898

(119909](119896119898) minus 119905119896119898

)] (13)

where 1198870and 119887



119896119898takes




LOF (119891119872) = GCV (119872) =

(1119899)sum119899119894=1 [119910119894 minus

119891119872

(119909119894)]

2

[1 minus 119862 (119872) 119899]2

(14)









1198941 119909lowast

1198942 119909lowast

119894119896


1198941 119909lowast

1198942 119909lowast









AML 17160 11380 000001ALL 6221 8022








AML 202680 165810 000001ALL 74770 54840


AML 24312 10991 000002ALL 48681 33069


AML 201560 138483 000003ALL 101583 50340





1198941 119909lowast

1198942 119909lowast

119894119896


1198941 119909lowast

1198942 119909lowast








Variables 1199091 1199092 1199093 1199094 1199095 1199096 1199097 1199098 1199099 11990910 11990911

VIF 209 297 276 256 161 346 133 188 685 158 547







truepredict 0 1

0 17 01 0 31


truepredict 0 1

0 2 11 6 15





119894-119899ℎ-119899119900-119871 where it stands





21000

2 1199097

7203 119909

8427

4 1199096

263














MethodACR

AML ALL Full sample


8875(815)

7417(543)


8516(814)

7761(310)

012625

4291

05

101520253035404550

ALL FullAMLIm

prov

emen

t (

)



5 Conclusions







Acknowledgment


References



























































Volume 2014




Journal of











Journal of


Function Spaces






Algebra











119894 119909

119894)119899

119894=1 119909119894

isin 119877119903 119910119894




1015840sdot 119909

119894+ 119887 = 0 where 119908


2 In order to


Min Φ(119909

) =

121003817100381710038171003817119908

1003817100381710038171003817

2

st 119910119894(119908

1015840119909

119894+ 119887) ge 1 119894 = 1 2 119899

(10)


Max Φ(119908 119887 120585 120572 120573)

=

119873

sum

119894=1120572119894minus

12

119873

sum

119894=1119895=1120572119894120572119895119910119894119910119895119909119879

119894119909119895

st119899

sum

119895=1120572119895119910119895= 0 0 le 120572

119894le 119862 119894 = 1 2 119899

(11)


119870(119909119894 119909119895) = exp (minus120574

10038171003817100381710038171003817119909119894minus119909119895

10038171003817100381710038171003817

2) 120574 ge 0 (12)



119891 (119909) = 1198870 +

119872

sum

119898=1119887119898

119870119898

prod

119896=1[119878119896119898

(119909](119896119898) minus 119905119896119898

)] (13)

where 1198870and 119887



119896119898takes




LOF (119891119872) = GCV (119872) =

(1119899)sum119899119894=1 [119910119894 minus

119891119872

(119909119894)]

2

[1 minus 119862 (119872) 119899]2

(14)









1198941 119909lowast

1198942 119909lowast

119894119896


1198941 119909lowast

1198942 119909lowast









AML 17160 11380 000001ALL 6221 8022








AML 202680 165810 000001ALL 74770 54840


AML 24312 10991 000002ALL 48681 33069


AML 201560 138483 000003ALL 101583 50340





1198941 119909lowast

1198942 119909lowast

119894119896


1198941 119909lowast

1198942 119909lowast








Variables 1199091 1199092 1199093 1199094 1199095 1199096 1199097 1199098 1199099 11990910 11990911

VIF 209 297 276 256 161 346 133 188 685 158 547







truepredict 0 1

0 17 01 0 31


truepredict 0 1

0 2 11 6 15





119894-119899ℎ-119899119900-119871 where it stands





21000

2 1199097

7203 119909

8427

4 1199096

263














MethodACR

AML ALL Full sample


8875(815)

7417(543)


8516(814)

7761(310)

012625

4291

05

101520253035404550

ALL FullAMLIm

prov

emen

t (

)



5 Conclusions







Acknowledgment


References



























































Volume 2014




Journal of











Journal of


Function Spaces






Algebra














AML 17160 11380 000001ALL 6221 8022








AML 202680 165810 000001ALL 74770 54840


AML 24312 10991 000002ALL 48681 33069


AML 201560 138483 000003ALL 101583 50340





1198941 119909lowast

1198942 119909lowast

119894119896


1198941 119909lowast

1198942 119909lowast








Variables 1199091 1199092 1199093 1199094 1199095 1199096 1199097 1199098 1199099 11990910 11990911

VIF 209 297 276 256 161 346 133 188 685 158 547







truepredict 0 1

0 17 01 0 31


truepredict 0 1

0 2 11 6 15





119894-119899ℎ-119899119900-119871 where it stands





21000

2 1199097

7203 119909

8427

4 1199096

263














MethodACR

AML ALL Full sample


8875(815)

7417(543)


8516(814)

7761(310)

012625

4291

05

101520253035404550

ALL FullAMLIm

prov

emen

t (

)



5 Conclusions







Acknowledgment


References



























































Volume 2014




Journal of











Journal of


Function Spaces






Algebra












truepredict 0 1

0 17 01 0 31


truepredict 0 1

0 2 11 6 15





119894-119899ℎ-119899119900-119871 where it stands





21000

2 1199097

7203 119909

8427

4 1199096

263














MethodACR

AML ALL Full sample


8875(815)

7417(543)


8516(814)

7761(310)

012625

4291

05

101520253035404550

ALL FullAMLIm

prov

emen

t (

)



5 Conclusions







Acknowledgment


References



























































Volume 2014




Journal of











Journal of


Function Spaces






Algebra



















MethodACR

AML ALL Full sample


8875(815)

7417(543)


8516(814)

7761(310)

012625

4291

05

101520253035404550

ALL FullAMLIm

prov

emen

t (

)



5 Conclusions







Acknowledgment


References



























































Volume 2014




Journal of











Journal of


Function Spaces






Algebra













Acknowledgment


References



























































Volume 2014




Journal of











Journal of


Function Spaces






Algebra









































Volume 2014




Journal of











Journal of


Function Spaces






Algebra
















Volume 2014




Journal of











Journal of


Function Spaces






Algebra









Documents

Research Article Integrated Use of Statistical-Based ...downloads.hindawi.com/journals/ddns/2015/261013.pdfIntegrated Use of Statistical-Based Approaches and Computational Intelligence