58
1/42

Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

  • Upload
    yahto

  • View
    19

  • Download
    0

Embed Size (px)

DESCRIPTION

BicAT_Plus: An Automatic Bi/Clustering Comparative Tool of Gene Expression Data Obtained Using Microarrays. Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen Biomedical Engineering Department, Cairo University, Giza , Egypt Mohamed H. Ali - PowerPoint PPT Presentation

Citation preview

Page 1: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

1/42

Page 2: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

2/42

BicAT_Plus: An Automatic Bi/Clustering Comparative Tool of Gene Expression Data Obtained Using Microarrays

Fadhl M. Al-AkwaaBiomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

Biomedical Engineering Department, Cairo University, Giza , Egypt Mohamed H. Ali

Computer Science School, Nottingham University, Nottingham, United Kingdom Yasser M. Kadah

Center for Informatics Sciences, Nile University, EgyptBiomedical Engineering Department, Cairo University, Giza , Egypt

Page 3: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

3/42

What is Bioinformatics?

Bioinformatics is defined as the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to understanding of biological processes.

Page 4: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

4/42

The Central Dogma

Transcription

Translation

A AFNG

GS T

SD

K

DNA

RNA

PROTEIN

nucleus

cytoplasm

cytoplasm

Gene Expression Level

Page 5: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

5/42

Gene Expression Level

Protein Level

Translation Rate

Transcription

Rate

+

+

+

-

GENE A

External or internal stimuli

Biological Balance Feedback System

Gene on

Gene off

Disease Drug

Page 6: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

6/42

Transcriptome data: Microarray Technology

Gene Expression Data

C1 Cm

G1 0.5 1 2

Gn 3 2 1

Page 7: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

7/42

Gene Expression Level Protein

Level

Translation Rate

Transcription

Rate

+

+

+

-

GENE A

External or internal stimuli

Biological Balance Feedback System

Gene Expression Level

Protein Level

+

+

-

GENE B

Transcription

Rate

Translation Rate

+

Page 8: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

8/42

Biological Balance Feedback System

Gene A

Gene B

Balance Feedback Loop system

- g2

g1

g4

g3

_

_+

+ _

_

+

_

Gene Regulatory Network GRN

Page 9: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

9/42

Gene Regulatory Network GRN

Page 10: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

10/42

Biological Data Base

Transcription

Translation

A AFNG

GS T

SD

K

DNA

RNA

PROTEIN

Page 11: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

11/42

Drug Discovery• One of the main objective of bioinformatics is how to integrate this

database to advance in human health.

Drug Discovery

Disease Ontology

Page 12: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

12/42

Drug Discovery & GRN

g2

g1

g4

g3

_

_+

+ _

_

+

_

•The costs to bring a new drug vary from around 500 million to 2,000 million dollars •Drug Design required the sophisticated understanding of how genes interact with each others construct GRN.

Page 13: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

13/42

Microarray Image Segmentation

Traditional clustering methodsBicluster methods

Gene Expression Matrix

Dynamics Bayesian NetworkProbabilistic boolean NetworkFuzzy network………

Data Extraction Preprosseing

Network Generation Gene Clustering Drug Testing

Drug Discovery: GRN steps

NormalizationDiscretizationFiltrationMissing value Low entropyLow variance

Prepare Microarray chip

Sampling rateError

Experimental condition

Experimental Design

c1 c2 cm

g1

g2

gn

Page 14: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

14/42

Gene Expression Data Analysis: Clustering

m assays

n genes

ngenes

n genes

similarity matrix

clustergenes basedon similarity

Euclidean DistanceCorrelation coefficientPearson

Page 15: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

15/42

Hierarchical Clustering

g1 g2 g3 g4 g5

g1 0.23 0.00 0.95 -0.63

g2 0.91 0.56 0.56

g3 0.32 0.77

g4 -0.36

g5

g1 g4

g1 g2 g3 g4 g5

g1 0.23 0.00 0.95 -0.63

g2 0.91 0.56 0.56

g3 0.32 0.77

g4 -0.36

g5

• Find largest value in similarity matrix.• Join genes together.• Recompute matrix and iterate.

Page 16: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

16/42

Hierarchical Clustering

g1 , g4 g2 g3 g5

g1 ,

g4

0.37 0.16-

0.52

g2 0.91 0.56

g3 0.77

g5

g1 g4 g2 g3

g1 , g4 g2 g3 g5

g1 ,

g4

0.37 0.16-

0.52

g2 0.91 0.56

g3 0.77

g5

• Find largest value is similarity matrix.• Join clusters together.• Recompute matrix and iterate.

Page 17: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

17/42

Hierarchical Clustering

g1 , g4 g2 , g3 g5

g1 ,

g4

0.27-

0.52

g2 ,

g3

0.68

g5

g1 g4 g2 g3g5

g1 , g4 g2 , g3 g5

g1 ,

g4

0.27-

0.52

g2 ,

g3

0.68

g5

• Find largest value is similarity matrix.• Join clusters together.• Recompute similarity matrix and iterate.

Page 18: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

18/42

Hierarchical Clustering : dendogram

Eisen et al. (1998), PNAS, 95(25): 14863-14868

Page 19: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

19/42

Gene Expression Data Analysis: Clustering

• Cluster is a group of genes show similar expression profile along the experiments

• Examples– K-means– Hierarchal– Self Organization Map– Click– Model based clustering

Eisen et al. (1998), PNAS, 95(25): 14863-14868

Page 20: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

20/42

c1 c2 c3 c4 c5 c7 c8 c9 c10

g1 3 4 1 1 7 10 11 1 1

g2 5 6 1 1 0.5 0.1 1 1 1

g3 2 2 2 2 2 2 2 2 2

g4 1 1 1 1 2 2 2 1 1

g5 3 4 4 2 5 4 7 9 8

g6 6 7 1 9 0 6 4 2 1

g7 0.5 0.1 1 2 2 2 2 2 5

Gene Expression Data Analysis: Clustering Limitations

Page 21: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

21/42

Gene Expression Data Analysis: biClustering

George M. ChurchProfessor of Genetics,

Harvard Medical School

the mean squared residue score (MSRS),

Page 22: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

22/42

Biclustering Algorithms

Algorithm Author

Bivisu/ pClusters Kin-On Cheng et al.,2008Haixun Wang, 2002

RMSBE Xiaowen Liu and Lusheng Wang, 2006

Bimax Preli et al., 2006ROBA Alain B. Tchagang and Ahmed H.

Tewfik, 2005

x-motif Murali and Kasif, 2003

SAMBA Tanay et al., 2002

OPSM Ben-Dor et al., 2002Plaid Laura Lazzeroni and Art Owen, 2000

ISA Ihmels et al., 2002CC / δ biclusters Cheng and Church, 2000

Page 23: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

23/42

Which algorithm is suitable for my dataset?

Which algorithm is better? And do some algorithms have advantages over others?

Generally, comparing different biclustering algorithms is not straightforward as they differ in strategy, approach, computational complexity, number of parameters, and prediction ability.

Moreover, such methods are strongly influenced by user selected parameter values.

Paper IDEA

Page 24: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

24/42

BicAT-plus

• To our best knowledge, bicluster compassion toolbox has not been available in the literature.

• We have developed a comparative tool, which we will call “Bicat-plus” that includes the biological comparative methodology to enable researchers and biologists to compare between the different bi/clustering methods based on set of biological value and draw conclusion on the biological meaning of the results.

Page 25: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

25/42

BicATBicAT-plus is extension of BicAT Toolbox which is popular gene expression analysis toolbox which contains 5 biclustering and 2 traditional cluster algorithm.

•OPSM

•CC

•ISA

•X-motive

•BIMAX

•K-means

•Hierarchal

Page 26: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

26/42

BicAT-plus Comparison Methodology

g1,g2,g3,g4,g5,…

g1,g2,g3,g4,g5,… g1,g2,

g3,g4,g5,…

Algorithm A (n biclusters)

g1,g2,g3,g4,g5,…

g1,g2,g3,g4,g5,…

g1,g4,g5,

g1,g2,g3,g4,g5,… g1,g2,

g3,g4,g5,…

g1,g2,g3,g4,g5,…

g1,g2,g3,g4,g5,…

g1,g4,g5,

g1,g2,g3,g4,g5,…

Algorithm A (m biclusters)

Enriched bicluster=

have biological meaning

Enriched

not

Enriched

Function Pathway PPI PromotorGO KEGG BIOGRID GENE BANCK

Page 27: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

27/42

BicAT-plus Comparison Methodology

• Percentage of enriched bi/clusters

biclustersofnumbertotal

levelthisatbiclustersenrichedofNumberlevelcesignificanbiclusterenrichedofPercentage

100biclusterthisingenesofnumbertotal

biclusteraintermGOthesharinggenesofNotermGOaoffractionStudy

• Percentage of annotated genes per each bi/cluster

• The predictability power of algorithm to recover interested pattern selected by user.

Page 28: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

28/42

BicAT-plus Features

1. Adding more algorithms to the BicAT-plus tool in order to have one software package that employs most of the commonly used biclustering algorithms.

Page 29: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

29/42

BicAT-plus Features

2. Perform functional analysis (Gene Ontology) of bicluster genes using different GO categories 2. Biological Process

3. Molecular Function

4. Cellular Component

Page 30: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

30/42

BicAT-plus Features

3. Displaying the analysis and comparing results using graphical and statistical charts visualizations in multiple modes (2D and 3D).

Page 31: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

31/42

BicAT-plus Features

4. Comparing between the different biclustering algorithms based on different respective methdology

Page 33: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

33/42

Results

Bi/clustering Algorithm

Parameter settings

ISA tg = 2.0, tc = 2.0, seeds = 500CC

δ = 0.5, α = 1.2, M = 100OPSM l = 100BiVisu Ε = 60, Nr = 10, Nc = 5, = 25K-means K=100

We used Gasch gene expression data. http://genome-www.stanford.edu/yeast_stress/

We used the default parameters as authors recommend in their publications.

Page 34: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

34/42

Percentage of enriched bi/clusters

Page 35: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

35/42

Percentage of annotated genes per each bi/cluster

Page 36: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

36/42

The predictability power of algorithm to recover interested pattern

• The conditions applied in Gasch experiments varied from temperature shocks, hydrogen peroxide, the superoxide-generating drug menadione, the sulfhydryl-oxidizing agent diamide, the disulfide-reducing agent dithiothreitol, ……

• The user could compare bi-clusters algorithms based on which of them could recover defined pattern like which one of them could recover biclusters which have response to the conditions applied in Gasch experiments.

Page 37: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

37/42

GO Term / (number of annotated genes) K-means CC ISA Bivisu OPSM

GO:0006970response to osmotic stress / (83)

3 5 6 3 0

GO:0006979response to oxidative stress / (79)

2 7 11 0 0

GO:0046686response to cadmium ion / (102)

2 3

2 2 0

GO:0043330response to exogenous dsRNA / (7) 2 3 2 2 0

GO:0046685response to arsenic / (77)

2 0 2 2 0

GO:0009408response to heat / (24)

3 0 2 2 0

GO:0009409response to cold / (7)

0 0 2 0 0

GO:0009267cellular response to starvation / (44)

0 2 0 0 0

GO:0006995cellular response to nitrogen starvation / (5)

4 4 4 0 0

GO:0042149cellular response to glucose starvation / (5)

0 2 0 0 0

GO:0009651response to salt stress / (15)

2 7 0 0 0

GO:0042542response to hydrogen peroxide /(5)

0 0 0 2 0

GO:0000304response to singlet oxygen / (4)

2 0 0 0 0

Page 39: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

39/42

BicAT-plus This figure for people that want to extend BicAT-plus by adding new features (or fixing bugs).

Page 40: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

40/42

Conclusion

• The comparison methodology used in this study confirm that the bicluster and cluster algorithms can be considered as integrated modules; there is no certain algorithm that can recover all the interesting patterns, what algorithm A success to recover in certain data sets, Algorithm B might fail, and vice verse.

Page 41: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

41/42

Conclusion

• Using BicAT-plus, we can identify the highly enriched bi/clusters of the whole compared algorithms, Integrating them to solve the dimensionality reduction problem of the Gene regulatory network construction from the gene expression data where samples number are fewer than number of genes in the microarray dataset.

Page 42: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

42/42

Thanks

Page 43: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

43/42

BicAT-Plushttp://home.k-space.org/FADL/Downloads/BicAT_plus.zip

Page 44: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

44/42

Availability and Requirements

• Availability: you can free download from

• System requirements1. Java Runtime Environment (JRE). version 6 is recommended.2. Active Perl version 5.10

Note BicAT plus has been tested on a PC machine with the following

configurations: CPU: Pentium 4, 1.5 GHZ, RAM: 2.0 GB, Platform: windows XP professional with SP2.

Page 45: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

45/42

Algorithms comparison

• Generally, comparing different biclustering algorithms is not straightforward as they differ in strategy, approach, computational complexity, number of parameters, in addition to prediction ability.

• Moreover, such methods are strongly influenced by user selected parameter values. As a result, the quality of biclustering results is often considered more important than the required computation time.

Page 46: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

46/42

Algorithms comparison

• Although there are some analytical comparative studies to evaluate the traditional clustering algorithms (Azuaje, 2002; Datta and Datta, 2003; Yeung, et al.), no such comprehensive comparison of biclustering methods can be found in the literature so far (Prelic, et al., 2006).

Page 47: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

47/42

Cluster/bi-cluster algorithm performance comparison: Cluster Evaluation

g1,g2,g3,g4,g5,…

g1,g2,g3,g4,g5,… g1,g2,g3,g4,g5,…

Cluster 1 Cluster 2 Cluster n

..……

• Homogeneity between cluster genes

• Separation between clusters

“it is not clear how to extend notions such as homogeneity and separation (Gat-Viks et al., 2003) to the biclustering context (to our best knowledge, no general internal indices have been suggested so far for biclustering) “ Prelic, et al., 2006

Page 48: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

48/42

Cluster/bi-cluster algorithm performance comparison: Bicluster Evaluation

g1,g2,g3,g4,g5,…

g1,g2,g3,g4,g5,…

g1,g2,g3,g4,g5,…

bicluster 1 bicluster 2 bicluster n

..……

Function: hypergometric test with GeneOntology database

Pathway: KEEG

PPI: Biograd database

Promotor: Scan motif program

Page 49: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

49/42

Hyper Geometric Test

g1, g2, g3, g4, g5,

g6,g7,g8,g9,gX

Test set (X genes)

g1, g2, g3, g4, g5,

g6,g7,g8,g9,gN

Reference set (N genes)

Cluster1

when sampling X genes (test set) out of N genes (reference set), what is the probability that x or more of these genes belong to a functional category C shared by n of the N genes in the reference set?”.

Steven et al.(Maere, et al., 2005)

Page 50: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

50/42

The Gene Ontology

• The Gene ONTOLOGY (GO) is a project to put annotated genes( known function genes) in groups.

• Example in S. cerevisiae

• Function name =cellular

response to glucose starvation

function ID=GO:0042149

g1, g2, g3, g4, g5, g6

Page 51: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

51/42

Hyper Geometric Test: Example

g1, g2, g3, g4, g5,

g6,g7,g8,g9,g10

Cluster1(10)

g1, g2, g3, g4, g5, g6

GO:0042149 (6)

2,3,4,5,6

Cellular response to glucose starvation

Page 52: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

52/42

GO enrichment program with Hypergometric Test

• FuncAssociate• GeneMerge• GoMiner • FatiGO • GOstat • GO::TermFinder • http://www.geneontology.org/GO.tools.shtml• we used GeneMerege program which were developed

at University of Maryland C. I. Castillo-Davis, 2003

Page 53: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

53/42

GO Analysis programs Limitations

Reference set

Test set

Page 54: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

54/42

BicATSwiss Federal Institute of Technology Zurich, ETH Zentrum, 8092

Zurich, Switzerland

OPSM

CC

ISA

X-motive

BIMAX

K-means

Hierarchal

Page 55: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

55/42

BicAT-plus

• To our best knowledge, such An automatic gene ontology compassion tool has not been available in the literature.

• We have developed a comparative tool, which we will call “Bicat-plus” that includes the biological comparative methodology and to be as an extension to the BicAT program.

Page 56: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

56/42

BicAT-plus

• Moreover, BicAT-plus help researchers in comparing and evaluating the algorithms results multiple times according to the user selected parameter values as well as the required biological perspective on various datasets.

Page 57: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

57/42

Gene Expression Data Analysis: biClustering

• Recent understanding of cellular process leads to expect subsets of genes to be coregulated and coexpressed under certain experimental conditions, but to behave almost independently under other conditions.

A. Prelic,2006, Bioinformatics

• Bicluster is a group of genes show similar expression profile under certain conditions.

Page 58: Fadhl M. Al-Akwaa Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen

58/42

Cluster/bi-cluster algorithm performance comparison: Bicluster Evaluation

g1,g2,g3,g4,g5,…

g1,g2,g3,g4,g5,…

g1,g2,g3,g4,g5,…

bicluster 1 bicluster 2 bicluster n

..……

Function: hypergometric test with GeneOntology database

Pathway: KEEG

PPI: Biograd database

Promotor: Scan motif program