[IEEE 2010 2nd International Conference on Computer Technology and Development (ICCTD) - Cairo, Egypt (2010.11.2-2010.11.4)] 2010 2nd International Conference on Computer Technology

2010 2nd International Conference on Computer Technology and Development (lCCTD 2010)

A Gene Selection Approach for Classifying Diseases

Based on Microarray Datasets Taysir Hassan A. Soliman

Associate Professor, Information Systems Dept,

[email protected]

Adel A. Sewissy Professor, Computer Science

Dept [email protected]

Hisham AbdelLatif Researcher, Information

Systems Dept, [email protected]

Faculty of Computer and Information, Assiut University, Assiut, Egypt

Abstract- Gene Selection is very important problem in the classification of serious diseases in clinical information systems. A limitation of these gene selection methods is that they may result in gene sets with some redundancy and yield an unnecessary large number of candidate genes for classification analysis. In the current work, a hybrid approach is presented in order to classify diseases, such as colon cancer, leukemia, and liver cancer, based on informative genes. This hybrid approach uses clustering (K-means) with statistical analysis (ANOV A) as a preprocessing step for gene selection and Support Vector Machines (SVM) to classify diseases related to microarray experiments. To compare the performance of the proposed methodology, two kinds of comparisons were achieved: 1) applying statistical analysis combined with clustering algorithm (K-means) as a preprocessing step and 2) comparing different classification algorithms: decision tree (ID3), naive bayes, adaptive naive bayes, and support vector machines. In case of combining clustering with statistical analysis, much better classification accuracy is given of 97% rather than without applying clustering in the preprocessing phase. In addition, SVM had proven better accuracy than decision trees, Naive Bayes, and Adaptive Naive Bayes classification.

Keywords: Gene Selection, Feature Selection, Clustering, ANOVA test, Classification, Microarray data.

I. INTRODUCTION Even though the human genome sequencing project is

almost finished the analysis has just begun. Besides sequence information, microarrays are constantly delivering large amounts of data about the inner life of a cell. The new challenge is now to evaluate these gigantic data streams and extract useful information. Microarrays simultaneously measure the mRNA expression level of thousands of genes in a cell mixture. By comparing the expression profiles of different tissue types, we might find the genes that best explain a perturbation or might even help clarify how cancer is developing. As a result, doctors can discover proper treatment. Therefore, Microarray data are used to analyze and monitor gene expression data related to various diseases, such as cancer classification.

Recently, disease classification based on microarray gene expression experiments has attracted many researchers. However, classification based on gene expression data is not a simple task because of the characteristics of the data: high

978-1-4244-8845-2/10/ $ 26.00 © 2010 IEEE 626

dimensionality of genes and small sample size. This problem makes conventional machine learning tools not suitable for use because there is a necessity to perform dimensionality reduction as a preprocessing step for classifying diseases. To overcome the high dimensionality problem, genes that are truly related to the disease are required to be extracted.

The significance of finding the minimum gene subset is three fold: 1) It greatly reduces the computational burden and noise arising from irrelevant genes. 2) It simplifies gene expression tests to include only a very small number of genes rather than thousands of genes, which can bring down the cost for cancer testing significantly. 3) It calls for further investigation into the possible biological relationship between these small numbers of genes and cancer development and treatment.

Typically, informative genes are selected according to a statistical test, such as t-test, ANOV A, or other suitable tests according to the problem required. Our approach focuses on first finding similar genes by grouping them (via clustering) and then selecting informative genes from these groups through ANOV A statistical test to avoid redundancy.

In the current paper, an efficient classification methodology is developed based on clustering correlated group to find similar genes and applying the ANOV A test to determine informative genes. Section two clarifies related work; section three illustrates our proposed methodology. Section four discusses our results, applying our methodology to three different datasets of leukemia, colon and prostate cancer. Finally, a conclusion associated with future work is illustrated.

II. R ELATED WORK Researchers used a number of ways to do the process for

selecting informative genes in order to classify diseases from microarray data, reduce data redundancy, improve and evaluate classification accuracy. Some work depends on the traditional t test statistic [1,2, 3] and analysis of variance (ANOVA) F test statistic [4, 5]. While t is used for two-class prediction problems, F-test is used for multiclass problems. The test statistics t and F are not only used in class prediction, they also apply to the class discovery [6, 7]. The main goal of class discovery is to identify subtypes of diseases. The major difference between class prediction and class discovery is that the former uses labeled samples while the latter uses unlabeled samples.

2010 2nd International Conference on Computer Technology and Development (ICCTD 2010)

Kumar and Punithavalli [8] have derived a method for evaluating and improving techniques to select informative genes from microarray data by ranking genes according to ANOV A test-statistic. Besides they used Fuzzy Neural Network as a classifier to classifY informative genes; Jaeger et al [9] developed a method for evaluating and improving techniques for selecting informative genes from microarray data. Genes of interest are typically selected by ranking genes according to a test-statistic and then choosing the top k genes. Yang et al [10] have described an improved hybrid system for gene selection based on a recently proposed

Genetic Ensemble (GE) system. Cui et al [11] have proposed two methods based on the statistical ranking and selection framework to directly address the selection goal. However, combining both clustering and ANOV A as a preprocessing will give more accurate results for selecting informative genes. Golub [12] has developed a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemia as a test case. In the next section, the proposed classification methodology will be illustrated in more details.

III. PROPOSED CLASSIFICATION METHODOLOGY Our proposed method consists of two main steps:

preprocessing phase and classification phase, as illustrated in Figure 1. In the first phase, preprocessing steps are performed in order to find informative genes from Microarray experiments; these steps consist of applying Kmeans algorithm to cluster genes and ANOV A test to find informative genes from clustered genes. In the second phase, support vector machine algorithm is used as a classification algorithm to classifY those informative genes.

A. MICROARRAY EXPERIMENTS Sets of miniaturized chemical reaction areas that may

also be used to test DNA fragments, antibodies, or proteins, by using a chip having immobilized target and hybridizing them with probed sample. The color we get from the chip after hybridization is then scanned and the data is analyzed by a soft ware to find the expression level. A typical microarray holds spots representing several thousand to several tens of thousands of genes or ESTs (expressed sequence tags). After hybridization the microarray will be scanned and converted into numerical data. Finally, the data should be normalized [13]. The purpose of this step is to counter systematic variation (e.g. difference in labeling efficiency for different dyes, compensation for signal spill over from neighboring spots [14]) and to allow a comparison between different microarrays [15].

As a consequence, one has to identifY a small subset of informative genes from Microarray experiments, contributing most to the classification task. Performing feature selection is essential for microarray prediction problems, since high-dimensional problems usually involve higher computational complexity and bigger prediction errors.

627

Assume there are C (?: 2) distinct tumor tissue classes for the problem under consideration and N genes (inputs) and n tumor mRNA samples (observations).

Finding Selective Genes Phase Classification Phase

Clustering

Statistical

Clustering

Figure I. Classifier Proposed Methodology

Suppose Xgs is the measurement of the expression level of gene g from sample s for g = 1, . . . , p and s = 1, . . . , n. In terms of an expression matrix G, we may write:

1

Where columns and rows of the expression matrix G correspond to samples and genes, respectively. Note that G is a matrix consisting of data highly processed through preprocessing techniques that include image analysis and normalization and often logarithmic transformations. We assume that the data G are standardized so that the genes have mean 0 and variance 1 across samples. Given a fixed gene, let Yij be the expression level from the jth sample of the ith class. Note that these Yij come from the corresponding row ofG. For example, for gene 1, Yij are a rearrangement of the first row ofG. We consider the following general model for Yij:

with nI + n2 + ... +nk = n. In the model, Iii is a parameter representing the mean expression level of the gene in class i, D ij are the error terms such that D ij are independent normal random variables.

B. Phase One: Finding Selective Genes This phase, the preprocessing phase, consists of two

stages: clustering and Gene selection. In stage one, clustering is applied to Microarray gene expression experiments. These experiments can be obtained from any Microarray gene expression database. Clustering techniques


consider data tuples as objects. They partition the objects into groups or clusters, so that objects within a cluster are "similar" to one another and "dissimilar" to objects in other clusters. Similarity is commonly defined in terms of how "close" the objects are in space, based on a distance function. The "quality" of a cluster may be represented by its diameter, the maximum distance between any two objects in the cluster. Centroid distance is an alternative measure of cluster quality and is defined as the average distance of each cluster object from the cluster centroid.

1. k-means clustering algorithm The K-means algorithm, probably the simplest one of the

clustering algorithms proposed, is based on a very simple idea: Given a set of initial clusters, assign each point to one of them, then each cluster center is replaced by the mean point on the respective cluster. These two simple steps are repeated until convergence. A point is assigned to the cluster which is close in Euclidean distance to the point. The procedure follows a simple and easy way to classifY a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early group age is done. At this point we need to re-calculate k new centroids as bary centers of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function, in this case a squared error function, which is as follows:

(3)

where Izf'J -CJr is a chosen distance measure between a U1

data point.l, and the cluster centre Cj, is an indicator of

the distance of the n data points from their respective cluster centres. The k-means algorithm randomly selects k of the objects, each of which initially represents a cluster mean or center. For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster mean. It then computes the new mean for each cluster. This process iterates until the criterion function converges.

628

2. ANOV A Test In the second step of the preprocessing phase, to select

informative genes from groups clustering, there are many different techniques applicable as t-test, ANOVA test, Wilcoxon rank and others. Test statistic is a numerical summary of a set of data that reduces the data to one or a small number of values that can be used to perform a hypothesis test. A test statistic shares some of the same qualities of a descriptive statistic and many statistics can be used as both test statistics and descriptive statistics. However, a test statistic is specifically intended for use in statistical testing, whereas the main quality of a descriptive statistic is that it is easily interpretable. Some informative descriptive statistics, such as the sample range, do not make good test statistics since it is difficult to determine their sampling distribution.

In our case, we used the traditional ANOV A F test to find selective genes from clusters, which is computed as follows:

(4)

�i h

h Yi- = ..f-il.Y�J/� Yi-. = � '1lI-Yi-/'Il

w ere � , _.L..,,_ ·�l '2' E�' - z

s:i = � - Y.. 'lIi - 1 and l :1=1 ( .fJ .) I( . For simplicity, we use I to indicate the sum is taken over the index i. Under Ho and assuming variance homogeneity, this wellknown test statistic has a distribution of Fk-J,Il-k [13].

C. Phase Two: Building SVM Classifier: Classification of diseases depending on the selective

informative genes is the main goal of our second phase. Choosing the best classifier is not an easy task to perform but the question is how to choose the best classifier. Support Vector Machines (SVM), a promising new method for the classification of both linear and nonlinear data. SVM is an algorithm that works as follows. It uses a nonlinear mapping to transform the original training data into a higher dimension. Support Vector Machines (SVM) is a typical example of a powerful data mining algorithm that can produce results of very good quality and can address problems not amenable to traditional statistical analysis. SVM includes, as special cases, a large class of neural networks, radial basis functions, and polynomial classifiers. In fact, SVM has emerged as the algorithm of choice for modeling challenging high-dimensional data, where other techniques under-perform. Example domains include text mining [16], image mining [17], and bioinformatics [18].

Within this new dimension, it searches for the linear optimal separating hyperplane (that is, a "decision boundary" separating the tuples of one class from another). With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane. SVM finds this hyperplane using support vectors ("essential" training tuples) and margins (defined by


the support vectors)]. In addition, SVM algorithm is capable of constructing models that are complex enough to solve difficult real world problems. At the same time, SVM models have a simple functional form and are amenable to theoretical analysis.

In the current paper, support vector machine is applied as a classifier to classifY whether a patient has a disease or not. There is a number of current available software, which provides access to support vector machines, whether they are open source software or commercial. Examples of these tools are WEKA (as open source software) and Oracle Miner suite (starting from 10g version). We applied the Oracle miner suite and compared support vector machines results to classifY diseases with decision tree classifier (using �4.5 algorithm) and naive bayes, where support vector machmes gave the best results.

IV. R ESULTS AND DISCUSSION Three benchmark microarray datasets for system

evaluation were used, including binary-class, as illustrated in Table 1. The "Leukemia" dataset [7] investigates the expression of two different subtypes of leuk�mia (27 �L and 11 AML); the "Colon" dataset [8] contams expressIOn patterns of 22 normals and 40 cancerous tissues. Finally, the "Liver" dataset [9] has 82 samples labeled as Hepatocellular carcinoma (HCC) and other 75 samples labeled as Nontumor. Oracle miner suite (1Og) was used, specifically decision tree (C4.5), naive bayes classifier, and SVM. The task for these three datasets is to identifY a small group of genes which can distinguish samples from two classes.

1 M· d Table . : Icroarray atase s

Name Leukemia Colon Sample 38 62

Gene 7129 2000 Class 2 2

Liver 44

300 2

To find selective genes, two main steps are performed according to our proposed m�thodology: preprocessi�g a�d classification. For preprocessmg: first, K-means algOrIthm IS applied for each dataset to group similar genes and ANOV A testing was chosen for finding selective genes in clusters. Then, for each selective genes within Leukemia, Colon, or Liver cancer datasets, to build our classifier using SVM, 60% of the data was taken as training data and 40% was taken as testing data sets.

Applying ANOV A to leukemia dataset, we got 713 significant genes and the rest are non-significant. Testing three groups ANOV A test, using MutiArray Viewer, when p-value=O.Ol. Using the Lukemia dataset, the result from testing is significant genes= 713 from 7129. A sample of the selective informative genes is shown in Table 2.

Applying various classifiers to Leukemia �elect�ve informative genes, the highest accuracy we got IS usmg SVM, which is 97%. The rest of the classifiers, which we compared with are decision trees (C4.5: 79%), naive Bayes (79%), and adaptive naive Bayes (84%).

629

Using Receiver Operating Characteristics (ROC) analysis for evaluating classifiers, where ROC provide� a means to compare individual models and determme thresholds which yield a high proportion of positive hits. Figure 2 illustrates ROC curve for leukemia data classification.

Table 2: A Sample of Selective Infonnative Genes

AFFX-BioC-5 st -653 -577 -490

AFFX-HSAC07/X0035I 5 at 16287 15770 16386

AFFX-

HSAC07/X0035I Mat 18727 18773 19091

AFFX-HSAC07/X0035I 5 st -I -91 -220

The horizontal axis of an ROC graph measures the false positive rate as a percentage. The verti�al axis s�ows the t�e positive rate. The top left hand corner IS the optImal locatIOn in an ROC curve, indicating high true-positive rate versus low false-positive rate. The area under the ROC curve measures the discriminating ability of a binary classification model. Furthermore, the larger the area under the curve, the higher is the likelihood that an actual positive case will be assigned a higher probability of being positive than an act�al negative case. For example, the area und�r the curve uSI�g SVM is greater than the other three classIfiers, as shown m Figure 2. We repeated the same proposed methodology to confirm our results for colon cancer and liver cancer as well, where number of selective genes is 725 and the 625, respectively.

In addition, a comparison of CPU time performa�ce between SVM and Decision Tree, Naive Bayes, Adaptive Naive Bayes algorithms, as shown in Table 3. In case of Leukemia dataset, one can see that building the Naive Bayes took the least time to build and to test, which was 13 seconds and 14 seconds, respectively. On the other hand, Adaptive Bayes Network took the highest time, which was 1:18 seconds and failed to build on the colon cancer data, WhICh is due to the data itself Taking the three datasets to measure CPU time, SVM took a bit higher of 2 seconds more than Decision Tree and Naive Bayes. However, its accuracy was a lot higher than both algorithms.

V. CONCLUSION AND FUTUR E WOR K In this paper, a new methodology was proposed to

classifY diseases from Microarray experiments thro�gh applying a hybrid approach of K-means and ANOVA testmg. In addition, Support Vector Machine classification algorithm was applied as one of the successful classification algorithms. Three datasets were used: Leukemia, colon cancer, and liver cancer. As an average accuracy, we obtained the highest accuracy using SVM which was 97%, compared to three other classifiers: decision tree (C4.5), naive bayes, and adaptive naive bayes algorithms. I� addition, �PU performance time was computed when usmg each algOrIthm for both building the classifier and testing the classifier. We plan to use this classifier as a part of a bi0t.Dedical informatics system within a hospital to help doctors dIagnose various diseases.


REFERENCES

[I] M. Xiong, L. Jin, W Li, and E. Boerwinkle, "Computational methods for gene expression-based tumor classification," Biotechniques;29(6): 1264-1270,2000.

[2] D. V Nguyen and D. M. Rocke, "Tumor classification by partial least squares using microarray gene expression data," Bioinformatics. 8(1):39-50,2000.

[3] H. Liu, J. Li, and L. Wong, "A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns," Genome Inform Ser Workshop Genome Inform.; 13:5 1-60, 2002.

[4] D. Ghosh, "Singular value decomposition regression models for classification of tumors from microarray experiments," Proceedings of the 2002 Pacific Symposium on Biocomputing; Lihue, Hawaii: 2002, pp. 18-29.

[5] D. V Nguyen and D. M. Rocke, "Multi-class cancer classification via partial least squares with gene expression profiles," Bioinformatics, 18(9): 1216-1226, 2002.

[6] C Ding, 'Analysis of gene expression profiles: class discovery and leaf ordering," In: Proceedings of the 6th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2002); Washington, DC: 2002. pp. 127-136.

[7] W Li, M Fan, M. Xiong, "SamCluster: An integrated scheme for automatic discovery of sample classes using gene expression profile," Bioinformatics, 19(7):811-817,2003.

[8] K. A. Kumar and M. Punithavalli, "Minimal Feature Selection Using Statistical Techniques," Journal of Advances in Computational Sciences and Technology, Volume 3, VoL I, pp. 33-38,2010.

[9] 1. Jaeger , R. Sengupta , WL Ruzzo, "Improved gene selection for classification of micro arrays," (Pacific Symposium on Biocomputing, 2003,8 53-64.

[10] YH. Yang, M.1. Buckley, S. Dudoit, T.P. and Speed. Comparison of methods for image analysis on cDNA microarray data. Technical report, 2000.

[II] Y H. Yang, S. Dudoit, P. Luu and T. P. Speed. Normalization for cDNA Microarray Data. SPIE BiOS, 200 I.

[12] 1. Han and M. Kamber: Data Mining: Concepts and Techniques, Second Edition, Morgan Kaufmann Series in Data Management Systems. 2006.

[13] Y Saeys, I. Lnza, and P Larrafiaga, "A review of feature selection techniques in bioinformatics," Bioinformatics, 23, (19):2507-2517,2007.

[14] RL Somorjai, B. Dolenko, R Baumgartner, JE Crow and JH Moore, "Class prediction and discovery using gene microarray and

proteomics mass spectroscopy data: curses, caveats, cautions," Bioinformatics, 19: 1484-1491, 2003.

[I5] T.Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek and 1. Mesirov, " Molecular classification of cancer: class discovery and class prediction by gene expression," Science, voL 286, pp.53 1-537, 1999.

[16] U. Alon, et aL, "A: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays," PNAS, 96:6745-6750, 1999.

[I7] X. Chen et aL , "Gene expression patterns in human liver cancers," Molecular Biology of the Cell, 13: 1929-1939, 2002.

[18] 1. Hua et a1. ,"Optimal number of features as a function of sample size for various classification rules," Bioinformatics, 21:1509-1515, 2005.

[19] W Li and Y Yang, "How many genes are needed for a discriminant microarray data analysis?," Proceedings of Critical Assessment of Microarray Data Analysis, 137-150,2000.

[20] Xinping Cui , Haibing Zhao and Jason Wilson, "Optimized Ranking and Selection Methods for Feature Selection with Application in Microarray Experiments," Journal of Biopharmaceutical Statistics, Volume 20, Issue 2 March 2010 , pages 223 - 239.

" a.al� Oos'son(roo •. a .a� Na'vobayos

� D.eo _ Th ........ ol'" lI! � 0....

-0 ..... .,. ..... , .

g! D.e - T ......... .,.,'"

If 0.:2

-ROC C .. ""... � 0.04 = :;;:;�o,;::....

0.0 0.:2

0.0 0.2 0.... D.e O.B 1 .0 " : : �F_'._ Po."' __ a__ Adaptlvo bayosan�:ork

.

a:aFb?=- , .:.:o." , _a�OR_- a.a

supporl vector machine

i OA = ��:;:�. � -

T ......... "" '"

1: &:: _0 ..... .,. ...... , 0.:2

� 0....

-ROC C ......... ...

0.0 0.:2 0 ....

F_, __ Po.""".'. R__ 0.0 0.2 0.... o.co o.a

F_' __ Po_, ........... "" __

Figure 2. (a) ROC for Decision tree, (b) ROC for naIve Bayes, (c) ROC for adaptive Bayes network and (d) ROC for SVM

Dataset Training Testing CPU Time

SVM, DT, NO, ABN SVM, DT,NB, SVM DT NB ABN

Build Test Build Test Build Test Build

Leukemia 18 samples 12 samples 17s 18s ISs 16s 13s 14s I: 18s

(60%) (40%)

630

ABN

Test

18s


Colon Cancer 18 samples (60%) 12 samples (40%) 15s 16s 15s 16s 15s 16s Failed Failed

Liver Cancer 26 samples (60%) 17 samples (40%) 14s 16s 14s 14s 14s 16s l:I3s 20s

TABLE 3: Comparison of CPU Time between SVM and DT, NB, ANB

631

Documents

[IEEE 2010 2nd International Conference on Computer Technology and Development (ICCTD) - Cairo, Egypt (2010.11.2-2010.11.4)] 2010 2nd International Conference on Computer Technology