An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai

An Evaluation of Gene Selection Methods for Multi-class

Microarray Data Classification

by Carlotta Domeniconi and

Hong Chai

Outline

• Introduction to microarray data

• Problem description

• Related work

• Our methods

• Experimental Analysis

• Result

• Conclusion and future work

Microarray

• Measures gene expression levels across different conditions, times or tissue samples

• Gene expression levels inform cell activity and disease status

• Microarray data distinguish between tumor types, define new subtypes, predict prognostic outcome, identify possible drugs, assess drug toxicity, etc.

Microarray Data

• A matrix of measurements: rows are gene expression levels; columns are samples/conditions.

Example – Lymphoma Dataset

Microarray data analysis

• Clustering applied to genes to identify genes with similar functions or participate in similar biological processes, or to samples to find potential tumor subclasses.

• Classification builds model to predict diseased samples. Diagnostic value.

Classification Problem

• Large number of genes (features) - may contain up to 20,000 features.

• Small number of experiments (samples) – hundreds but usually less than 100 samples.

• The need to identify “marker genes” to classify tissue types, e.g. diagnose cancer - feature selection

Our Focus

• Binary classification and feature selection methods extensively studied; Multi-class case received little attention.

• Practically many microarray datasets have more than two categories of samples

• We focus on multi-class gene ranking and selection.

Related Work

Some criteria used in feature ranking• Correlation coefficient

• Information gain

• Chi-squared

• SVM-RFE

Notation

• Given C classes

• m observations (samples or patients)

• n feature measurements (gene expressions)

• class labels y= 1,...,C

ntn Rxx ),...,( 1x

Correlation Coefficient

• Two class problem: y = {-1,+1}• Ranking criterion defined in Golub:

• where μj is the mean and σ standard deviation along dimension j in the + and – classes; Large |w| indicates discriminant feature

jj

jjjw

Fischer’s score

• Fisher’s criterion score in Pavlidis:

22

2

)()(

)(

jj

jjjw

Assumption of above methods

• Features analyzed in isolation. Not considering correlations.

• Assumption: independent of each other

• Implication: redundant genes selected into a top subset.

Information Gain

• A measure of the effectiveness of a feature in classifying the training data.

• Expected reduction in entropy caused by partitioning the data according to this feature.

• V (A) is the set of all possible values of feature A, and Sv is the subset of S for which feature A has value v

)(||

)(),( )( vv

AVv SES

SSEASI

Information Gain

• E(S) is the entropy of the entire set S.

• wherewhere |Ci| is the number of training data in class Ci, and |S| is thecardinality of the entire set S.

||

||log

||

||)( 21 S

C

S

CSE iiC

i

Chi-squared

• Measures features individually• Continuous valued features discretized into

intervals• Form a matrix A, where Aij is the number of

samples of the Ci class within the j-th interval.

• Let CIj be the number of samples in the j-th interval

Chi-squared

• The expected frequency of Aij is

• The Chi-squared statistic of a feature is defined as

• Where I is the number of intervals. The larger the statistic, the more informative the feature is.

mCCE iIjji /||,

ij

ijijIj

Ci E

EA 2

112 )(

SVM-RFE

• Recursive Feature Elimination using SVM• In the linear SVM model on the full feature set Sign (w•x + b) w is a vector of weights for each feature, x is an

input instance, and b a threshold.

If wi = 0, feature Xi does not influence classification and can be eliminated from the set of features.

SVM-RFE

• After getting w for the full feature set, sort features in descending order of weights. A percentage of lower feature is eliminated.

3. A new linear SVM is built using the new set of features. Repeat the process.

4. The best feature subset is chosen.

Other criteria

• The Brown-Forsythe, the Cochran, and the Welch test statistics used in Chen, et al.

(Extensions of the t-statistic used in the two-class classification problem.)

• PCA (Disadvantage: new dimension formed. None of the original features can be discarded. Therefore can’t identify marker genes.)

Our Ranking Methods

• BScatter

• MinMax

• bSum

• bMax

• bMin

• Combined

Notation

• For each class i and each feature j, we define the mean value of feature j for class Ci:

• Define the total mean along feature j

jCxi

ij xC i

||

1,

jj xm x

1

Notation

• Define between-class scatter along feature j

2,

1

)(|| jiji

C

ij CB

Function 1: BScatter

• Fisher discriminant analysis for multiple classes under feature independence assumption. It credits the largest score to the feature that maximizes the ratio of the between-class scatter to the within-class scatter

• where σji is the standard deviation of class i along feature j

jiCi

jj

BBScatter

1

Function 2: MinMax

• Favors features along which the farthest mean-class difference is large, and the within class variance is small.

jiCi

jjjMinMax

1

min,max,

Function 3: bSum

• For each feature j, we sort the C values μj,i in non-decreasing order: μ j1 <= μj2…<= μ jC

• Define bj,l = μ j1+1 - μ j1 • bSum rewards the features with large distances

between adjacent mean class values:

jiCi

ljCl

j

bbSum

1,

11

Function 4: bMax

• Rewards features j with a large between-neighbor-class mean difference

jiCi

ljlj

bbMax

1,max

Function 5: bMin

• Favorsthe features with large smallest between-neighbor-class mean difference

jiCi

ljlj

bbMin

1,min

Function 6: Comb

• Considers a score function which combines MinMax and bMin

jiCi

jjljlj

bComb

1

min,max,, ))((min

Datasets

Dataset sample genes classes Comment

MLL 72 12582 3Available at http://research.nhgri.nih.gov/microarray/Supplement

Lymphoma 88 4026 6Number of samples in each class are, 46 in DLBCL, 11 in CLL, 9 in FL (malignant classes), 11 in ABB, 6 in

Yeast 80 5775 3RAT, and 6 in TCL (normal samples). available at http://llmpp.nih.gov/lymphoma

NCI60 61 1155 8Available at http://rana.lbl.gov/

Experiment Design

• Gene expression scaled between [-1,1]• Performed 9 comparative feature selection methods (6 proposed scores, Chi-squared, Information Gain, and

SVM-RFE)• Obtain subsets of top-ranked genes to train SVM classifier (3 kernel functions: linear, 2-degree polynomial, Gaussian;

Soft-margin [1,100]; Gaussian kernel [0.001,2])• Leave-one-out cross validation due to small sample size• One-vs-one multi-class classification implemented on

LIBSVM

Result – MLL Dataset

Result – Lymphoma Dataset

Conclusions• SVMs classification benefits from gene selection;• Gene ranking with correlation coefficients gives

higher accuracy than SVM-RFE in low dimensions in most data sets. The best performing correlation score varies from problem to problem;

• Although SVM-RFE shows an excellent performance in general, there is no clear winner. The performance of feature selection methods seems to be problem-dependent;

Conclusions

• For a given classification model, different gene selection methods reach the best performance for different feature set sizes;

• Very high accuracy was achieved on all the data sets studied here. In many cases perfect accuracy (based on leave-one-out error) was achieved;

• The NCI60 dataset [17] shows lower accuracy values. This dataset has the largest number of classes (eight), and smaller sample sizes per class. SVM-RFE handles this case well, achieving 96.72% accuracy with 100 selected genes and a linear kernel. The gap in accuracy between SVM-RFE and the other gene rankingmethods is highest for this dataset (ca. 11.5%).

Limitations & Future Work• The selection of features over the whole training

set induces a bias in the results. Will study valuable suggestions on how to assess and correct the bias in future experiments.

• Will take into consideration the correlation between any pair of selected features. Ranking method will be modified so that correlations are lower than a certain threshold.

• Evaluate top-ranked genes in our research against marker genes identified in other studies.

Documents

An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai