A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen...

A hybrid method for gene selection in microarray datasets

Yungho Leu, Chien-Pan Lee and Ai-Chen Chang

National Taiwan University of Science and Technology

2014/10/22

Outline

Experimental result

Microarray Datasets & Research Objective

Related work & backgroundResearch method

Conclusion

Microarray datasets

Microarray technology can be used to measure the expression levels of thousands of genes at the same time.

A microarray dataset records the gene expressions of different samples in a table.

3 Mobile Computing & Data Mining

Microarray datasets

N ： Number of samples (40~200) M ： Num. of genes (2,000~30,000) gi,j ： expression level of gene j at sampel i

Class label ： the class label of the sample

Mobile Computing & Data Mining Lab.

(M >> N)

M genes Class label

Gene1 Gene2 Class

S1 0.022 -0.721 0

S 2 -1.034 0.331 0

… … … …

Sj-1 -0.212 0.123 1

Sj 0.542 0.431 1

The Prostate cancer dataset ： (Simplified)

0 ： Absent1 ： Present

Research objective

M>>N pose challenges in diagnosis (or Classification)

To select a minimal subset of genes with high classification accuracy rate.

A gene selection problem

Outline

Experimental result

Related work & Background Research method

Conclusion

Related work

Ding, C., & Peng, H. used the Pearson correlation coefficient

to eliminate redundant genes from microarray datasets.

Minimum redundancy feature selection from microarray gene

expression data.(2003 & 2005)

Yang, et al. proposed to use information gain and genetic

algorithms for gene selection.

IG-GA: A Hybrid Filter/Wrapper Method for Feature

Selection of Microarray Data.(2010)

Related work

Luo, et al. clustered genes into groups and treated genes in the

same group as redundant genes.

Improving the Computational Efficiency of Recursive

Cluster Elimination. (2011)

Background knowledge

Information Gain: Proposed by Quinlan as a basis of attribute selection in Decision Tree.

Attributes with larger information gains are better for classification (or differentiating between different class labels of data samples).

Ecological correlation (Robinson)

Ecological Correlation

Divide dataset into groups, use the means of different

groups to calculate the Pearson correlation

coefficients.

Reduce the in-group variance, increase the value of

correlation coefficient between attributes.

Example

Leukemia1 dataset grouped by class labels (0 ,1 ,2)

Cor(gene1 {μ0, μ1, μ2},gene2{μ0, μ1, μ2}) = -0.9886

gene1 gene2 class

-0.9058 -0.9298 0

0.8371 -1.3022 0

1.0694 -0.7826 1

-1.5851 -0.8680 1

-0.1908 -0.6507 2

-1.0578 0.8268 2

μ0 μ1 μ2

gene1 -0.0344 -0.2578 -0.6243gene2 -1.1160 -0.8253 0.0881

Support Vector Machine

A classification method by Cortes & Vapnik(1995) To find a good hyper-plane to separate samples with different

class labels.

∣a1-a∣> |b1-b∣

Hyper-plane a is better than hyper-plane b.

margin

Support Vectors

a2a1 a

Outline

Experimental result

Conclusion

Research method

Data preprocessing

Step I ： Gene filtering using IG

Step II ： Redundant gene elimination using clustering

Step III ： Subset refinement using genetic algorithm

Data preprocessing － Normalization

Normalize the dataset using Z-Score

Z score of gene expression Xij:

‐ Xij ： the expression gene j on sample i.

‐ ： Mean of gene i’s expression over different samples

‐ Si ： standard deviation of gene i’s expression over different samples.

iijij S

Gene filtering by information gain

Gene filtering

Most of the genes have their IG values equal to 0.

Select the gene with IG greater than 0 for candidate genes.

For example, the Leukemia1 dataset has 5,327 genes; only 263

genes left after gene filtering with IG.

Grouping of gene

Grouping of genes

Gene list and threshold of cor. Build the list of candidate genes Set threshold = 0.8 （ strong positively correlated ）

Grouping method ： With the first gene on the list as the basis, group the rest genes

with the basis gene if their correlation coefficients is greater than 0.8.

Gene ID Cor.

Gene 1,2 0.83Gene 1,3 0.53Gene 1,4 0.32Gene 1,5 0.13

... ...

gene ID

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Build a gene list Calculate correlation coefficients

Eliminate the genes in the group from the list; repeat the same procedure on the rest of genes until no gene left on the list.

Eliminate genes from the existing group

Gene ID

Gene 3

Gene 4

Gene 5

Gene ID Cor.

Gene 1,2 0.83

Gene 1,3 0.53

Gene 1,4 0.32

Gene 1,5 0.13

Gene1 Gene2

Gene3 Gene4 Gene5

Cluster1:

Select the gene with the highest IG from each group.

Select one gene from each group

ANOVA ：

For dataset with more than three two class labels, use

ANOVA to test whether the class means are all equal.

• Hypothesis:

Gene with no different means over different class labels

are eliminated.

Eliminate genes with no classification capability

equal allnot are：H

T-test

T-test is used to test whether the class means of a gene are

different.

Genes with no different class means are eliminated.

The significant level α is set to 0.05.

Eliminate genes

Subset refinement using GA

Subset refinement

Encoding ： Binary Encoding:

• ” 0 ”--- gene not selected; ” 1 ”---gene is selected.

• example ： 011001---select the 2nd, 3rd, 6th genes from the

subset.

Chromosome length: the candidate gene subset from

step II.

Population size=5

Number of Iteration =1,000

Subset refinement

Fitness function ： the accuracy rate of SVM of the chromosome.

Selection method: Roulette Wheel • Selection probability is in proportional to the fitness value of

the chromosome

Single point crossover and mutation ：• Crossover Rate =0.7• Mutation Rate = 0.3

Termination condition

Termination condition ： (any of the following)• Accuracy rate = 100%• # of iteration = 1,000• # of iteration is greater than 100 and the accuracy rates

of the last 20 iterations are all the same. Final solution ： the chromosome with the largest

fitness value in the last iteration.

Outline

Experimental result

Conclusion

The datasets

Data set name # of samples # of class labels # of genes

9_Tumors 60 9 5,726

Brain_Tumor1 90 5 5,920

Brain_Tumor2 50 4 10,367

Leukemia1 72 3 5,327

Leukemia2 72 3 11,225

Lung Cancer 203 5 12,600

SRBCT 83 4 2,308

11_Tumors 174 11 12,533

Prostate Tumor 102 2 10,509

DLBCL 77 2 5,469

GEMS ： http://www.gems-system.org/

Genes selected in 3 steps

Data Set# of original

genesIG Grouping GA

9_Tumors 5,726 103 25 13

Brain_Tumor1 5,920 185 19 10

Brain_Tumor2 10,367 3,099 19 4

Leukemia1 5,327 263 7 4

Leukemia2 11,225 3,097 6 3

Lung_Cancer 12,600 3,183 36 18

SRBCT 2,308 351 14 7

11_Tumors 12,533 3,483 510 255

Prostate_Tumor 10,509 671 235 119

DLBCL 5,469 315 169 84

Compare with other paper

Comparisons of Our method(Hybrid), GEPUBLIC, PAM, IG-GA

Data Set GEPUBLIC PAM IG-GA Hybrid

9_Tumors 66.67(19) 43.33 (47) 85.00 (52) 71.67(13)

Brain_Tumor1 84.44(30) 85.56 (42) 93.33 (244) 91.12(10)

Brain_Tumor2 80.00(15) 66.00 (25) 88.00 (489) 92.00(4)

Leukemia1 97.22(11) 93.06 (11) 100.00 (82) 97.23(4)

Leukemia2 91.67(31) 91.67 (52) 98.61 (782) 100.00(3)

Lung_Cancer 94.58(29) 93.60 (75) 95.57 (2101) 97.05(18)

SRBCT 98.80(26) 98.80 (41) 100.00 (56) 100.00(7)

11_Tumors 86.21(87) 81.61 (203) 92.53 (479) 91.95(255)

Prostate_Tumor 95.10(4) 93.14 (13) 96.08 (343) 94.12(119)

DLBCL 97.40(13) 80.52 (70) 100.00 (107) 97.40(84)

Outline

Experimental result

Conclusion

Each step in our method effectively reduces noisy genes from

its previous step.

The hybrid method select fewer genes with higher

classification accuracy rate.

Need to further improve the hybrid method over 2-class

microarray datasets.

Q & AThank you for your listening.

Information Gain

For a dataset D with m different class lables, Info(D) measure how well the classes of D are evenly distributed ：

InfoA ： The equivalent Info (weighted sum) of subsets of D,

where D is split into subsets using attribute A ：

Gain(A) ：

)(log)( 21

ii ppDInfo

DDInfo

(D)InfoInfo(D)Gain(A) A

, Pi ： prob. of a sample in D belongs to class i.

A ： {a1,a2,…,av} ， attr. A has v different valuesD ： is split into {D1,D2,…,Dv}Di ： contains samples with A equal to aj

Data Mining: Concepts and Techniques

Attribute Selection: Information Gain

• Class P: buys_computer = “yes” ： 9• Class N: buys_computer = “no” ： 5

694.0)2,3(14

)0,4(14

4)3,2(

IIDInfoage

048.0)_(151.0)(029.0)(

ratingcreditGainstudentGainincomeGain

246.0)()()( DInfoDInfoageGain age

940.0)14

9)5,9()( 22 IDInfo

Age incomestuden

tcredit Buy

<=30 high no fair no<=30 high no excellent no

31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no

31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes

<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes

>40 medium no excellent no

age P N<=30 2 3

31…40 4 0>40 3 2

A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen...

Documents

สถาบันทดสอบทางการศึกษา ... · 2016-10-04 · 72 12 2552 18. p he Leu Leu Leu Leu Leu Leu Met Val Val Val Val NIETS Tyr Stop Sto His His

Andre leu ifoam--agroecology

Leu koplakia short r

FEBRUARY 2004 MOKETE MOSES LEU

LEU lankstinukas (anglų kalba). 2014

Art by Filip Leu

Slidecast James Leu

Lien Chien-Hsing

Ssu Maa Chien

Chien Kuang

Microarray Analysis - The Basicstgirke/HTML_Presentations/Manuals/Microarray/... · Microarray Analysis The Basics Thomas Girke December 9, 2011 Microarray Analysis Slide 1/42

helix-turn-helix motif (Homeodomain protein)muraka-h/class/2018/Molbio-42-49.pdfleucine zipper 部分で2量体を形成 leucine zipper protein α-helix Leu Leu Leu Leu Leu Leu Leu

THE FRENCH LOCUTION QUEUE LEU LEU

Roman Li Chien

Un beau chien

Yang, Chien-Hui

Implementing LEU and UEH

Chapter 5: Microarray Techniques - Columbia University · Chapter 5: Microarray Techniques 5.2 Analysis of Microarray Data 2 Overview Normalization Clustering. 2 3 Processing Microarray

Horoscop Leu

Chien Shiung Wu