Predicting Gene Functionsscc/Lecture_material/Predicting... · Molecular and ligand design with molecular docking. ... Basics of Data Sources Related Problems Combining Multi-Source

Predicting Gene Functions

Shubhra Sankar [email protected]

Center for Soft Computing Research,

Indian Statistical Institute, Kolkata, India

Tasks in Bioinformatics:Alignment, comparison and analysis of DNA, RNA, and Protein SeqAlignment, comparison and analysis of DNA, RNA, and Protein Sequencesuences

Interpretation and analysis of microarray gene expression dataInterpretation and analysis of microarray gene expression data

Gene mapping on chromosomesGene mapping on chromosomes

Gene finding and promoter identification from DNA sequencesGene finding and promoter identification from DNA sequences

Predicting gene regulatory networkPredicting gene regulatory network

Construction of phylogenetic trees for studying evolutionary relConstruction of phylogenetic trees for studying evolutionary relationshipationship

Structure prediction and classification of DNA, RNA and proteinStructure prediction and classification of DNA, RNA and protein

Molecular and ligand design with molecular docking.Molecular and ligand design with molecular docking.

Predicting Function of Unclassified GenesPredicting Function of Unclassified Genes

Need for Pattern Recognition and Structure Prediction Algorithms to Understand the Meaning of Data

Motivation

Tasks involving in Gene function prediction

Basics of Data Sources

Related Problems

Combining Multi-Source information

Validation

Gene Function Prediction

Summary

References

MotivationOne of the important goals of biological investigation is to predict the function of unclassified genes.

An approach to this direction involves identifying the nearest classified genes using different data sources, such as, microarray gene expressions, protein sequences, protein-protein interaction data and pathway information from Kyoto Encyclopedia of Genes and Genomes (KEGG) .

Even in a model organism like Yeast, there are more than 800 genes with unknown biological function in Munich Information for Protein Sequences (MIPS) and Saccharomyces Genome Database (SGD).

Single data set can assess functional relationships between genes and can assign accurate functional annotation to a significant number of unclassified genes but they alone often lack the degree of specificity needed for accurate gene function prediction.

This improvement in specificity can be achieved through the combination of multiple data sets in an integrated analysis.

Tasks involving Multi-SourceinformationIntegration for Gene

Function PredictionChoosing informative data sources

Extracting similarities/scores from individual sources

Benchmarking data-sources in common framework

Finding a method for integrating different similarities/scores

Finding highly similar gene-pairs

Predicting networks/clusters of highly similar genes

Predicting function of unclassified genes

Data SourcesDifferent types of Data Source that can be used for gene function prediction are:

Microarray Gene Expression

Protein similarity through transitive homology

Protein-protein interaction information

KEGG Pathway information

According to MIPS the accuracies of all these data sources are above 50%.

Phlogenetic profiles, Rosetta Stone Linkages, and Medline Abstract are avoided for low accuracies.

Measuring Gene expression with Microarray By performing biological experiments mRNA from

experimental samples are colored during reverse transcription with the red-fluorescent dye Cy5.

Cy5/Cy3 fluorescence ratio (gene expression) are obtained from microarray by measuring the spot intensities with fluorescence scanner

Many unanswered, and important, questions could potentially be answered by correctly selecting, assembling, analyzing, and interpreting microarray data.

gene Cell Cycle 1 Cell Cycle 2 Sporulation 1 Sporulation 2 Shock 1 Shock 2 Diauxic Shift 1

YDR029W 0.05 -0.21 0.16 0.19 0 2.19 -2.0

YBL052C -0.38 -0.71 0.14 0.33 0.01 0.09 1.7

YOR337W -0.42 1.78 0 0.62 0.64 0.45 2.92

YMR183C -0.86 -0.67 -3.23 0.09 -0.09 0.48 0.49

YKR021W -0.85 1.31 -0.46 3.42 -2.38 0.19 0.64

YHR023W -1.77 -0.86 -1.07 0.1 0.28 0.91 0.97

YHR029C -0.58 -0.91 -0.62 -0.13 0.06 -0.08 2.03

Microarray Gene Expression values

Dynamic Range of

PMT

-1.2 to 1.2 -3.0 to 3.0 -1.5 to 1.5 -2.0 to 2.0

Similarity Extraction through Gene ExpressionGene similarity is extracted using Pearson Correlation

and defined as

Where, xi and yi are the expression values of gene X and Y at the ith time point, respectively. and are mean and standard deviation of expression profile of gene X.

∑=

⎟⎟⎠

⎞⎜⎜⎝

⎛ −⎟⎟⎠

⎞⎜⎜⎝

⎛ −=

k

i y

i

x

ikYX

YyXxP1

1, σσ

xσX

Protein Sequence Similarity extraction through Transitive Homologues

6221 Yeast protein sequences, corresponding to 6221 ORF/Genes, are downloaded from SGD and compared with 33,57,450 protein sequences, downloaded from Uniprot database, using BLAST

Proteins related by direct homology are all classified

Transitive homology can identify new relations through 33,57,450 sequences

BLAST score ( ) is used as similarity between two protein sequences X and Y.

YX,B

How BLAST (Basic Local Alignment Search Tool) Works ?Given a query sequence, look for high scoring words of length W

in the database

Compile list L of all words that score >T

When some word found: Extend alignment

When score drops below X stop extension

Report all words with large score S

W: Word size – minimum number of aligned amino acids (3)

T: Threshold – focus on pairs scoring >T (11)

X: Drop-off – stop extending when loss >X (15)

S: Score – the final score of segment pair

There are twenty types of amino acids; each pair of amino acids have a similarity score, which varies for different amino acids

Aligning protein sequences: (gap = -5)

FDSK THRGHR Blosum MatrixFDSYWTH GHRScore: 6+6+4-2-5+5+8-5+6+8+5 = 36

FDS=16, GHR=19No loss in extension (+1)

Example of Transitive Homology:a

Ba,b=0.8

b

Bb,c=0.2 Ba,c=0.9

C

The homology between sequence b and sequence c can be detected with third sequence a, and now

BTb,c = Ba,b x Ba,c =0.72

KEGG PATHWAY is a collection of manually drawn pathway maps representing our knowledge on the molecular interaction and reaction networks for:

1. Metabolism 2. Genetic Information Processing 3. Environmental Information Processing 4. Cellular Processes 5. Human Diseases

and also on the structure relationships (KEGG drug structure maps) in:

6. Drug Development

Protein Similarity through KEGG Pathway information210 biological pathways are defined in KEGG, e.g. Metabolism

All protein sequences corresponding to each pathway are downloaded (except yeast proteins)

210 pathway databases are created

A profile for each Yeast protein is computed by searching homologues across 210 pathway databases

Profile dimension is 210

an element of a profile, corresponding to a particular pathway database, is 1 if homologue is present in that database, otherwise 0

KEGG pathway Cont.

Profile similarity between two proteins X and Y is calculated by taking the dot product.

Example:

ProfileX

ProfileY

Dot product between profiles of protein1 and protein2 is 2.

Dot products between KEGG profiles are denoted as KX,Y.

Metabolism Cell Cycle Energy Transcription1 0 0 11 0 1 1

Protein-protein Interaction Information

Information is downloaded from Database of Interacting Proteins (DIP) containing 320000 Yeast protein-protein interactions

For a given pair of genes/proteins the similarity value is 1 or 0, indicating a interaction present or absent, respectively

Similarity value is denoted as I

Problems in Multi-Data Source Integration1) The degree of ‘biological accuracy’ is different for different data

sources.

To obtain equivalency in ‘biological accuracy’, the similarities arising from various data-sources are separately benchmarked, based on the super GO-Slim process annotations of genes in the SGD database. The positive predictive value (PPV) of gene-pairs at a particular similarity value can be used as a benchmarking method.

TP pairs are defined as pairs of genes having overlapping GO (Gene Ontology) term classification/annotation.

Positive Predictive Value of gene-pairs is defined as

PPV = no. of predicted pairs that share common GO termtotal no. of predicted pairs

A gene pair is considered as a predicted pair if similarity value is non-zero and both the genes in the pair is classified in SGD

Higher the ppv greater the functional similarity between gene-pairs, predicted by a similarity measure or method .

ppv can be used as the fitness function.

Benchmarking the similarity values obtained from different data-sources in terms of their PPV. The PPV values for intermediate similarity values, that are not plotted in the figure, are calculated from the slopes of the respective curves. The similarities extracted from protein-protein interactions are binary relations in our study. Therefore, PPV for protein-protein interactions has a constant value 0.69 at a similarity value of 1 and hence it is not shown in the Figure.

0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 10

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

0 .8

0 .9

1

S im ila rit y V a lue ----->

prop

ortio

nTP

----

-->

Trans it ive hom o logy

K E G G P a thw ay p ro fileM ic roa rray

Biological Score using Linear Combination

BS two genes X and Y is computed by integrating PPV values PPX,Y,PBX,Y, PKX,Y, and PIX,Y

The relative contribution/weight of each information source is determined adaptively by maximizing the PPV, dependent on SGD classification of Yeast genes

BS is defined as

Where, a, b, c, and d are varied in steps of 1 to find a combination that maximizes the PPV for classified genes of top 30000 gene pairs.

dcbaPId

dcbaPKc

dcbaPBb

dcbaPPa YXYXYXYX

+++×

++++

×+

+++×

++++

×= ,,,,

YX,BS

Non-Linear Score

Non-Linear Score is defined as

Where, a, b, c, and d are varied in steps of -1 to find a combination that maximizes the PPV for classified genes. n is the total number of datasources.

The degrees of contribution of each information source is determined by maximizing the PPV, dependent on SGD classification of Yeast genes

)(1NLS ,,,,YX, YXYXYXYXdPIcPKbPBaPP

n+++=

Evaluation of BS and NLS

As GO (Gene Ontology) classification is used to determine the degrees of contribution, MIPS annotation can be used to evaluate the BS and NLS

Performance can be compared with individual data-sources and related methods by plotting “Total no. of predicted gene-pairs” vs. “positive predictive value” (shown in next slide) by increasing the threshold for individual similarity value.

No. of predicted gene-pairs decreases as threshold increases for any similarity measure

PPV of gene-pairs increases as threshold increases for any similarity measure.

0 1 2 3 4 5 6 7 8 9

x 104

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of top relations----->

PP

V--

----

>

Transitive homology

Lee et al. Prob. Network with same sourceLee et al. Prob. Network

Microarray

Phenotypic Profile

Non-Linear ScoreLinear Combination Score

KEGG Pathway profiles

Influence of number of classified genes on Positive Predictive Value

Top PredictionsGene functions are predicted from the first 10 neighbors, using

NLS

Top function predictions consist the functions of 12 unclassified genes and 417 classified genes with 98.2% PPV.

The prediction is performed with the MIPS 2008 classification and validated with 2011 classification.

Top 12 function predictions for unclassified gene

Out of these 12 unclassified genes, YIL080W, YHR218W, YHR219W, and YHR049W are now classified in MIPS, and our predictions are in agreement with MIPS.

Summary•Frameworks for multiple data-source integration, that combines pairwise similarity from different sources, are presented.

•Functional categories of 12 unclassified Yeast genes are predicted.

•Evaluation on 12 unclassified genes, by Saccharomyces Genome Database (SGD), confirmed the validity and potential value of the framework for gene function prediction

Selected References1. S. S. Ray, S. Bandyopadhyay and S. K. Pal, “A Weighted Power Framework for Integrating

Multi-Source Information: Gene Function Prediction in Yeast”, IEEE Transactions on Biomedical Engineering, vol. preprint, no. 00, pp. 1-7, 2012.

2. S. S. Ray, S. Bandyopadhyay and S. K. Pal, “Combining Multi-Source Information through Functional Annotation based Weighting: Gene Function Prediction in Yeast”, IEEE Transactions on Biomedical Engineering, vol. 56, no. 2, pp. 229-236, 2009.

3. I. Lee, S. V. Date, A. T. Adai, and E. M. Marcotte, “A probabilistic functional network of yeast genes,” Science, vol. 306, pp. 1555–1558, 2004.

4. C. V. Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields, and P. Bork, “Comparative assessment of large-scale data sets of protein-protein interactions,” Nature, vol. 417, pp. 399–403, 2002.

5. E. M. Marcotte, M. Pellegrini, M. J. Thompson, T. O. Yeates, and D. Eisenberg, “A combined algorithm for genome-wide prediction of protein function,” Nature, vol. 402, pp. 83–86, 1999.

6. E. M. Marcotte, M. Pellegrini, H. L. Ng, D. W. Rice, T. O. Yeates, and D. Eisenberg, “Detecting protein function and protein-protein interactions from genome sequences,”Science, vol. 285, pp. 751–753, 1999.

7. M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proc. National Academy of Sciences, vol. 95, pp. 14863-14867, 1998.

8. M. Kanehisa, S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T. Katayama, M. Araki, and M. Hirakawa, “From genomics to chemical genomics: new developments in KEGG,” Nucleic Acids Res., vol. 34, pp. D354–D357, 2006.

9. L Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit J. U. Bowie, and D. Eisenberg, “The database of interacting proteins,” Neuclic Acid Research, vol. 32, pp. 449451, 2004.

10. O. G. Troyanskaya, K. Dolinski, A. B. Owen, R. B. Altman, and D. Botstein, “A bayesian framework for combining heterogeneous data sources for gene function prediction (in saccharomyces cerevisiae),” Proc. Natl. Acad. Sci. USA, vol. 100, no. 14, pp. 8348–8353, 2003.

11. Q. Ma, G. W. Chirn, R. Cai, J. D. Szustakowski, and N. Nirmala, “Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks,” BMC Bioinformatics, vol. 6, no. 242, 2005.

12. S. F. Altschul, T. L. Madden, A. A. Schffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, “Gapped blast and psi-blast: a new generation of protein database search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389–3402, 1997.

13. P. Pipenbacher, A. Schliep, S. Schneckener, A. Schonhuth, D. Schomburg, and R. Schrader, “Proclust: improved clustering of protein sequences with an extended graph-based approach,” Bioinformatics, vol. 18, no. 2, pp. S182S191, 2002.

Thank You

Handling Gene Expressions

Shubhra Sankar Ray

[email protected]

Center for Soft Computing Research,

Indian Statistical Institute, Kolkata, India

Gene expression

Process by which a gene's coded information is converted into the structures present and operating in the cell.

Expressed genes include those that are transcribed into mRNA andthen translated into protein and those that are transcribed into RNA but not translated into protein (e.g., transfer and ribosomal RNAs).

Not all genes are expressed and gene expression involves the study of the expression level of genes in the cells under different conditions.

Measuring Gene expression with Microarray Enables measuring at the same time expression levels of thousands of genes.

Is typically a glass or slide, on which DNA molecules are attached at fixed locations and colored with the green-fluorescent dye Cy3 .

There may be tens of thousands of spots on an array, each containing about 107-108 identical DNA molecules.

For gene expression studies, each of these molecules ideally should identify one gene or one exon in the genome

The spots are either printed on the microarrays by a robot, or synthesized by photo-lithography or by ink-jet printing.

By performing biological experiments mRNA from experimental samples are colored during reverse transcription with the red-fluorescent dye Cy5.

Cy5/Cy3 fluorescence ratio (gene expression) are obtained from microarray by measuring the spot intensities with fluorescence scanner

Many unanswered, and important, questions could potentially be answered by correctly selecting, assembling, analyzing, and interpreting microarray data.

A gene expression database can be regarded as consisting of three parts – the gene expression data matrix, gene annotation and sample annotation.

Figure : Gene expression array

gene Cell Cycle 1 Cell Cycle 2 Sporulation 1 Sporulation 2 Shock 1 Shock 2 Diauxic Shift 1

YDR029W 0.05 -0.21 0.16 0.19 0 2.19 -2.0

YBL052C -0.38 -0.71 0.14 0.33 0.01 0.09 1.7

YOR337W -0.42 1.78 0 0.62 0.64 0.45 2.92

YMR183C -0.86 -0.67 -3.23 0.09 -0.09 0.48 0.49

YKR021W -0.85 1.31 -0.46 3.42 -2.38 0.19 0.64

YHR023W -1.77 -0.86 -1.07 0.1 0.28 0.91 0.97

YHR029C -0.58 -0.91 -0.62 -0.13 0.06 -0.08 2.03

Microarray Gene Expression values

Dynamic Range of

PMT

-1.2 to 1.2 -3.0 to 3.0 -1.5 to 1.5 -2.0 to 2.0

Average linkage hierarchical clustering is one of the first clustering algorithms applied to microarray data. Using a distance metric, the method builds a hierarchical binary tree (called a dendrogram). Given a set of N data points to be clustered, and an N × N distance (or similarity) matrix, the basic steps of hierarchical clustering are :

S1) Start by assigning each item to a cluster, so that if there are N items there are N clusters, each containing just one item. So, the distances (similarities) between the clusters are the same as the distances (similarities) between the items they contain.

S2) Find the closest (most similar) pair of clusters and merge them into a single cluster, so that there is one less cluster.

S3) Compute distances (similarities) between the new cluster and each of the old clusters.

S4) Repeat S2 and S3 until all items are clustered into a single cluster of size N.

Hierarchical Clustering

1. Single Linkage

2. Average Linkage

3. Complete Linkage

In single-linkage clustering (also called the connectedness or minimum method), the shortest distance from any member of one cluster to any member of the other cluster is considered as the distance between one cluster and another cluster.

In complete-linkage (also called the diameter or maximum method), the distance between one cluster and another cluster is considered to be equal to ‘the largest distance from any member of one cluster to any member of the other cluster’

In average-linkage clustering ‘average distance from any member of one cluster to any member of the other cluster’ is considered.

Gene Ordering in Clustering Solutions

Subclusters found by Gene Ordering

3 Cell Cycle and DNA Processing

4 Transcription

6 Protein Fate (folding, modification, destination)

7 Protein with Binding Function or Cofactor Requirement

9 Cellular Transport, Transport Facilitation and Transport Routes

Gene ordering helps to identify the number and size of subclusters

Relations among genes within a cluster can be identified with gene ordering

Figure 2: Comparing Average Linkage (Fig. a), Average Linkage + Leaf Ordering (Fig. b), K-means (Fig. c), and K-means +Gene ordering (Fig. d) for Herpes data (106 x 21)

Documents

Predicting Gene Functionsscc/Lecture_material/Predicting... · Molecular and ligand design with molecular docking. ... Basics of Data Sources Related Problems Combining Multi-Source