Upload
others
View
1
Download
1
Embed Size (px)
Citation preview
Predicting Gene Functions
Shubhra Sankar [email protected]
Center for Soft Computing Research,
Indian Statistical Institute, Kolkata, India
Tasks in Bioinformatics:Alignment, comparison and analysis of DNA, RNA, and Protein SeqAlignment, comparison and analysis of DNA, RNA, and Protein Sequencesuences
Interpretation and analysis of microarray gene expression dataInterpretation and analysis of microarray gene expression data
Gene mapping on chromosomesGene mapping on chromosomes
Gene finding and promoter identification from DNA sequencesGene finding and promoter identification from DNA sequences
Predicting gene regulatory networkPredicting gene regulatory network
Construction of phylogenetic trees for studying evolutionary relConstruction of phylogenetic trees for studying evolutionary relationshipationship
Structure prediction and classification of DNA, RNA and proteinStructure prediction and classification of DNA, RNA and protein
Molecular and ligand design with molecular docking.Molecular and ligand design with molecular docking.
Predicting Function of Unclassified GenesPredicting Function of Unclassified Genes
Need for Pattern Recognition and Structure Prediction Algorithms to Understand the Meaning of Data
Motivation
Tasks involving in Gene function prediction
Basics of Data Sources
Related Problems
Combining Multi-Source information
Validation
Gene Function Prediction
Summary
References
MotivationOne of the important goals of biological investigation is to predict the function of unclassified genes.
An approach to this direction involves identifying the nearest classified genes using different data sources, such as, microarray gene expressions, protein sequences, protein-protein interaction data and pathway information from Kyoto Encyclopedia of Genes and Genomes (KEGG) .
Even in a model organism like Yeast, there are more than 800 genes with unknown biological function in Munich Information for Protein Sequences (MIPS) and Saccharomyces Genome Database (SGD).
Single data set can assess functional relationships between genes and can assign accurate functional annotation to a significant number of unclassified genes but they alone often lack the degree of specificity needed for accurate gene function prediction.
This improvement in specificity can be achieved through the combination of multiple data sets in an integrated analysis.
Tasks involving Multi-SourceinformationIntegration for Gene
Function PredictionChoosing informative data sources
Extracting similarities/scores from individual sources
Benchmarking data-sources in common framework
Finding a method for integrating different similarities/scores
Finding highly similar gene-pairs
Predicting networks/clusters of highly similar genes
Predicting function of unclassified genes
Data SourcesDifferent types of Data Source that can be used for gene function prediction are:
Microarray Gene Expression
Protein similarity through transitive homology
Protein-protein interaction information
KEGG Pathway information
According to MIPS the accuracies of all these data sources are above 50%.
Phlogenetic profiles, Rosetta Stone Linkages, and Medline Abstract are avoided for low accuracies.
Measuring Gene expression with Microarray By performing biological experiments mRNA from
experimental samples are colored during reverse transcription with the red-fluorescent dye Cy5.
Cy5/Cy3 fluorescence ratio (gene expression) are obtained from microarray by measuring the spot intensities with fluorescence scanner
Many unanswered, and important, questions could potentially be answered by correctly selecting, assembling, analyzing, and interpreting microarray data.
gene Cell Cycle 1 Cell Cycle 2 Sporulation 1 Sporulation 2 Shock 1 Shock 2 Diauxic Shift 1
YDR029W 0.05 -0.21 0.16 0.19 0 2.19 -2.0
YBL052C -0.38 -0.71 0.14 0.33 0.01 0.09 1.7
YOR337W -0.42 1.78 0 0.62 0.64 0.45 2.92
YMR183C -0.86 -0.67 -3.23 0.09 -0.09 0.48 0.49
YKR021W -0.85 1.31 -0.46 3.42 -2.38 0.19 0.64
YHR023W -1.77 -0.86 -1.07 0.1 0.28 0.91 0.97
YHR029C -0.58 -0.91 -0.62 -0.13 0.06 -0.08 2.03
Microarray Gene Expression values
Dynamic Range of
PMT
-1.2 to 1.2 -3.0 to 3.0 -1.5 to 1.5 -2.0 to 2.0
Similarity Extraction through Gene ExpressionGene similarity is extracted using Pearson Correlation
and defined as
Where, xi and yi are the expression values of gene X and Y at the ith time point, respectively. and are mean and standard deviation of expression profile of gene X.
∑=
⎟⎟⎠
⎞⎜⎜⎝
⎛ −⎟⎟⎠
⎞⎜⎜⎝
⎛ −=
k
i y
i
x
ikYX
YyXxP1
1, σσ
xσX
Protein Sequence Similarity extraction through Transitive Homologues
6221 Yeast protein sequences, corresponding to 6221 ORF/Genes, are downloaded from SGD and compared with 33,57,450 protein sequences, downloaded from Uniprot database, using BLAST
Proteins related by direct homology are all classified
Transitive homology can identify new relations through 33,57,450 sequences
BLAST score ( ) is used as similarity between two protein sequences X and Y.
YX,B
How BLAST (Basic Local Alignment Search Tool) Works ?Given a query sequence, look for high scoring words of length W
in the database
Compile list L of all words that score >T
When some word found: Extend alignment
When score drops below X stop extension
Report all words with large score S
W: Word size – minimum number of aligned amino acids (3)
T: Threshold – focus on pairs scoring >T (11)
X: Drop-off – stop extending when loss >X (15)
S: Score – the final score of segment pair
There are twenty types of amino acids; each pair of amino acids have a similarity score, which varies for different amino acids
Aligning protein sequences: (gap = -5)
FDSK THRGHR Blosum MatrixFDSYWTH GHRScore: 6+6+4-2-5+5+8-5+6+8+5 = 36
FDS=16, GHR=19No loss in extension (+1)
Example of Transitive Homology:a
Ba,b=0.8
b
Bb,c=0.2 Ba,c=0.9
C
The homology between sequence b and sequence c can be detected with third sequence a, and now
BTb,c = Ba,b x Ba,c =0.72
KEGG PATHWAY is a collection of manually drawn pathway maps representing our knowledge on the molecular interaction and reaction networks for:
1. Metabolism 2. Genetic Information Processing 3. Environmental Information Processing 4. Cellular Processes 5. Human Diseases
and also on the structure relationships (KEGG drug structure maps) in:
6. Drug Development
Protein Similarity through KEGG Pathway information210 biological pathways are defined in KEGG, e.g. Metabolism
All protein sequences corresponding to each pathway are downloaded (except yeast proteins)
210 pathway databases are created
A profile for each Yeast protein is computed by searching homologues across 210 pathway databases
Profile dimension is 210
an element of a profile, corresponding to a particular pathway database, is 1 if homologue is present in that database, otherwise 0
KEGG pathway Cont.
Profile similarity between two proteins X and Y is calculated by taking the dot product.
Example:
ProfileX
ProfileY
Dot product between profiles of protein1 and protein2 is 2.
Dot products between KEGG profiles are denoted as KX,Y.
Metabolism Cell Cycle Energy Transcription1 0 0 11 0 1 1
Protein-protein Interaction Information
Information is downloaded from Database of Interacting Proteins (DIP) containing 320000 Yeast protein-protein interactions
For a given pair of genes/proteins the similarity value is 1 or 0, indicating a interaction present or absent, respectively
Similarity value is denoted as I
Problems in Multi-Data Source Integration1) The degree of ‘biological accuracy’ is different for different data
sources.
To obtain equivalency in ‘biological accuracy’, the similarities arising from various data-sources are separately benchmarked, based on the super GO-Slim process annotations of genes in the SGD database. The positive predictive value (PPV) of gene-pairs at a particular similarity value can be used as a benchmarking method.
TP pairs are defined as pairs of genes having overlapping GO (Gene Ontology) term classification/annotation.
Positive Predictive Value of gene-pairs is defined as
PPV = no. of predicted pairs that share common GO termtotal no. of predicted pairs
A gene pair is considered as a predicted pair if similarity value is non-zero and both the genes in the pair is classified in SGD
Higher the ppv greater the functional similarity between gene-pairs, predicted by a similarity measure or method .
ppv can be used as the fitness function.
Benchmarking the similarity values obtained from different data-sources in terms of their PPV. The PPV values for intermediate similarity values, that are not plotted in the figure, are calculated from the slopes of the respective curves. The similarities extracted from protein-protein interactions are binary relations in our study. Therefore, PPV for protein-protein interactions has a constant value 0.69 at a similarity value of 1 and hence it is not shown in the Figure.
0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 10
0 .1
0 .2
0 .3
0 .4
0 .5
0 .6
0 .7
0 .8
0 .9
1
S im ila rit y V a lue ----->
prop
ortio
nTP
----
-->
Trans it ive hom o logy
K E G G P a thw ay p ro fileM ic roa rray
Biological Score using Linear Combination
BS two genes X and Y is computed by integrating PPV values PPX,Y,PBX,Y, PKX,Y, and PIX,Y
The relative contribution/weight of each information source is determined adaptively by maximizing the PPV, dependent on SGD classification of Yeast genes
BS is defined as
Where, a, b, c, and d are varied in steps of 1 to find a combination that maximizes the PPV for classified genes of top 30000 gene pairs.
dcbaPId
dcbaPKc
dcbaPBb
dcbaPPa YXYXYXYX
+++×
++++
×+
+++×
++++
×= ,,,,
YX,BS
Non-Linear Score
Non-Linear Score is defined as
Where, a, b, c, and d are varied in steps of -1 to find a combination that maximizes the PPV for classified genes. n is the total number of datasources.
The degrees of contribution of each information source is determined by maximizing the PPV, dependent on SGD classification of Yeast genes
)(1NLS ,,,,YX, YXYXYXYXdPIcPKbPBaPP
n+++=
Evaluation of BS and NLS
As GO (Gene Ontology) classification is used to determine the degrees of contribution, MIPS annotation can be used to evaluate the BS and NLS
Performance can be compared with individual data-sources and related methods by plotting “Total no. of predicted gene-pairs” vs. “positive predictive value” (shown in next slide) by increasing the threshold for individual similarity value.
No. of predicted gene-pairs decreases as threshold increases for any similarity measure
PPV of gene-pairs increases as threshold increases for any similarity measure.
0 1 2 3 4 5 6 7 8 9
x 104
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of top relations----->
PP
V--
----
>
Transitive homology
Lee et al. Prob. Network with same sourceLee et al. Prob. Network
Microarray
Phenotypic Profile
Non-Linear ScoreLinear Combination Score
KEGG Pathway profiles
Influence of number of classified genes on Positive Predictive Value
Top PredictionsGene functions are predicted from the first 10 neighbors, using
NLS
Top function predictions consist the functions of 12 unclassified genes and 417 classified genes with 98.2% PPV.
The prediction is performed with the MIPS 2008 classification and validated with 2011 classification.
Top 12 function predictions for unclassified gene
Out of these 12 unclassified genes, YIL080W, YHR218W, YHR219W, and YHR049W are now classified in MIPS, and our predictions are in agreement with MIPS.
Summary•Frameworks for multiple data-source integration, that combines pairwise similarity from different sources, are presented.
•Functional categories of 12 unclassified Yeast genes are predicted.
•Evaluation on 12 unclassified genes, by Saccharomyces Genome Database (SGD), confirmed the validity and potential value of the framework for gene function prediction
Selected References1. S. S. Ray, S. Bandyopadhyay and S. K. Pal, “A Weighted Power Framework for Integrating
Multi-Source Information: Gene Function Prediction in Yeast”, IEEE Transactions on Biomedical Engineering, vol. preprint, no. 00, pp. 1-7, 2012.
2. S. S. Ray, S. Bandyopadhyay and S. K. Pal, “Combining Multi-Source Information through Functional Annotation based Weighting: Gene Function Prediction in Yeast”, IEEE Transactions on Biomedical Engineering, vol. 56, no. 2, pp. 229-236, 2009.
3. I. Lee, S. V. Date, A. T. Adai, and E. M. Marcotte, “A probabilistic functional network of yeast genes,” Science, vol. 306, pp. 1555–1558, 2004.
4. C. V. Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields, and P. Bork, “Comparative assessment of large-scale data sets of protein-protein interactions,” Nature, vol. 417, pp. 399–403, 2002.
5. E. M. Marcotte, M. Pellegrini, M. J. Thompson, T. O. Yeates, and D. Eisenberg, “A combined algorithm for genome-wide prediction of protein function,” Nature, vol. 402, pp. 83–86, 1999.
6. E. M. Marcotte, M. Pellegrini, H. L. Ng, D. W. Rice, T. O. Yeates, and D. Eisenberg, “Detecting protein function and protein-protein interactions from genome sequences,”Science, vol. 285, pp. 751–753, 1999.
7. M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proc. National Academy of Sciences, vol. 95, pp. 14863-14867, 1998.
8. M. Kanehisa, S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T. Katayama, M. Araki, and M. Hirakawa, “From genomics to chemical genomics: new developments in KEGG,” Nucleic Acids Res., vol. 34, pp. D354–D357, 2006.
9. L Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit J. U. Bowie, and D. Eisenberg, “The database of interacting proteins,” Neuclic Acid Research, vol. 32, pp. 449451, 2004.
10. O. G. Troyanskaya, K. Dolinski, A. B. Owen, R. B. Altman, and D. Botstein, “A bayesian framework for combining heterogeneous data sources for gene function prediction (in saccharomyces cerevisiae),” Proc. Natl. Acad. Sci. USA, vol. 100, no. 14, pp. 8348–8353, 2003.
11. Q. Ma, G. W. Chirn, R. Cai, J. D. Szustakowski, and N. Nirmala, “Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks,” BMC Bioinformatics, vol. 6, no. 242, 2005.
12. S. F. Altschul, T. L. Madden, A. A. Schffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, “Gapped blast and psi-blast: a new generation of protein database search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389–3402, 1997.
13. P. Pipenbacher, A. Schliep, S. Schneckener, A. Schonhuth, D. Schomburg, and R. Schrader, “Proclust: improved clustering of protein sequences with an extended graph-based approach,” Bioinformatics, vol. 18, no. 2, pp. S182S191, 2002.
Thank You
Handling Gene Expressions
Shubhra Sankar Ray
Center for Soft Computing Research,
Indian Statistical Institute, Kolkata, India
Gene expression
Process by which a gene's coded information is converted into the structures present and operating in the cell.
Expressed genes include those that are transcribed into mRNA andthen translated into protein and those that are transcribed into RNA but not translated into protein (e.g., transfer and ribosomal RNAs).
Not all genes are expressed and gene expression involves the study of the expression level of genes in the cells under different conditions.
Measuring Gene expression with Microarray Enables measuring at the same time expression levels of thousands of genes.
Is typically a glass or slide, on which DNA molecules are attached at fixed locations and colored with the green-fluorescent dye Cy3 .
There may be tens of thousands of spots on an array, each containing about 107-108 identical DNA molecules.
For gene expression studies, each of these molecules ideally should identify one gene or one exon in the genome
The spots are either printed on the microarrays by a robot, or synthesized by photo-lithography or by ink-jet printing.
By performing biological experiments mRNA from experimental samples are colored during reverse transcription with the red-fluorescent dye Cy5.
Cy5/Cy3 fluorescence ratio (gene expression) are obtained from microarray by measuring the spot intensities with fluorescence scanner
Many unanswered, and important, questions could potentially be answered by correctly selecting, assembling, analyzing, and interpreting microarray data.
A gene expression database can be regarded as consisting of three parts – the gene expression data matrix, gene annotation and sample annotation.
Figure : Gene expression array
gene Cell Cycle 1 Cell Cycle 2 Sporulation 1 Sporulation 2 Shock 1 Shock 2 Diauxic Shift 1
YDR029W 0.05 -0.21 0.16 0.19 0 2.19 -2.0
YBL052C -0.38 -0.71 0.14 0.33 0.01 0.09 1.7
YOR337W -0.42 1.78 0 0.62 0.64 0.45 2.92
YMR183C -0.86 -0.67 -3.23 0.09 -0.09 0.48 0.49
YKR021W -0.85 1.31 -0.46 3.42 -2.38 0.19 0.64
YHR023W -1.77 -0.86 -1.07 0.1 0.28 0.91 0.97
YHR029C -0.58 -0.91 -0.62 -0.13 0.06 -0.08 2.03
Microarray Gene Expression values
Dynamic Range of
PMT
-1.2 to 1.2 -3.0 to 3.0 -1.5 to 1.5 -2.0 to 2.0
Average linkage hierarchical clustering is one of the first clustering algorithms applied to microarray data. Using a distance metric, the method builds a hierarchical binary tree (called a dendrogram). Given a set of N data points to be clustered, and an N × N distance (or similarity) matrix, the basic steps of hierarchical clustering are :
S1) Start by assigning each item to a cluster, so that if there are N items there are N clusters, each containing just one item. So, the distances (similarities) between the clusters are the same as the distances (similarities) between the items they contain.
S2) Find the closest (most similar) pair of clusters and merge them into a single cluster, so that there is one less cluster.
S3) Compute distances (similarities) between the new cluster and each of the old clusters.
S4) Repeat S2 and S3 until all items are clustered into a single cluster of size N.
Hierarchical Clustering
1. Single Linkage
2. Average Linkage
3. Complete Linkage
In single-linkage clustering (also called the connectedness or minimum method), the shortest distance from any member of one cluster to any member of the other cluster is considered as the distance between one cluster and another cluster.
In complete-linkage (also called the diameter or maximum method), the distance between one cluster and another cluster is considered to be equal to ‘the largest distance from any member of one cluster to any member of the other cluster’
In average-linkage clustering ‘average distance from any member of one cluster to any member of the other cluster’ is considered.
Gene Ordering in Clustering Solutions
Subclusters found by Gene Ordering
3 Cell Cycle and DNA Processing
4 Transcription
6 Protein Fate (folding, modification, destination)
7 Protein with Binding Function or Cofactor Requirement
9 Cellular Transport, Transport Facilitation and Transport Routes
Gene ordering helps to identify the number and size of subclusters
Relations among genes within a cluster can be identified with gene ordering
Figure 2: Comparing Average Linkage (Fig. a), Average Linkage + Leaf Ordering (Fig. b), K-means (Fig. c), and K-means +Gene ordering (Fig. d) for Herpes data (106 x 21)