View
221
Download
0
Category
Tags:
Preview:
Citation preview
09 / 23 / 2005 eisner@cs.ualberta.ca 1
Predicting Protein Function Using Machine-Learned Hierarchical Classifiers
Roman Eisner
Supervisors: Duane Szafron and Paul Lu
09 / 23 / 2005 eisner@cs.ualberta.ca 2
Outline
Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion
09 / 23 / 2005 eisner@cs.ualberta.ca 4
Proteins
Functional Units in the cell Perform a Variety of Functions
e.g. Catalysis of reactions, Structural and mechanical roles, transport of other molecules
Can take years to study a single protein Any good leads would be helpful!
09 / 23 / 2005 eisner@cs.ualberta.ca 5
Protein Function Prediction and Protein Function Determination Prediction:
An estimate of what function a protein performs Determination:
Work in a laboratory to observe and discover what function a protein performs
Prediction complements determination
09 / 23 / 2005 eisner@cs.ualberta.ca 6
Proteins
Chain of amino acids 20 Amino Acids
FastA Format:
>P18077 – R35A_HUMAN
MSGRLWSKAIFAGYKRGLRNQREHTALLKIEGVYARDETEFYLGKR
CAYVYKAKNNTVTPGGKPNKTRVIWGKVTRAHGNSGMVRAKFRSNL
PAKAIGHRIRVMLYPSRI
09 / 23 / 2005 eisner@cs.ualberta.ca 7
Ontologies
Standardized Vocabularies (Common Language)
In biological literature, different terms can be used to describe the same function e.g. “peroxiredoxin activity” and
“thioredoxin peroxidase activity” Can be structured in a hierarchy to show
relationships
09 / 23 / 2005 eisner@cs.ualberta.ca 8
Gene Ontology
Directed Acyclic Graph (DAG) Always changing Describes 3 aspects of protein annotations:
Molecular Function Biological Process Cellular Component
09 / 23 / 2005 eisner@cs.ualberta.ca 9
Gene Ontology
Directed Acyclic Graph (DAG) Always changing Describes 3 aspects of protein annotations:
Molecular Function Biological Process Cellular Component
09 / 23 / 2005 eisner@cs.ualberta.ca 10
Hierarchical Ontologies
Can help to represent a large number of classes
Represent General and Specific data Some data is incomplete – could become
more specific in the future
09 / 23 / 2005 eisner@cs.ualberta.ca 12
Goal
To predict the function of proteins given their sequence
09 / 23 / 2005 eisner@cs.ualberta.ca 13
Data Set
Protein Sequences UniProt database
Ontology Gene Ontology Molecular Function aspect
Experimental Annotations Gene Ontology Annotation project @ EBI
Pruned Ontology: 406 nodes (out of 7,399) with ≥ 20 proteins
Final Data Set: 14,362 proteins
09 / 23 / 2005 eisner@cs.ualberta.ca 14
Outline
Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion
09 / 23 / 2005 eisner@cs.ualberta.ca 15
Predictors
Global: BLAST NN
Local: PA-SVM
PFAM-SVM
Probabilistic Suffix Trees
09 / 23 / 2005 eisner@cs.ualberta.ca 16
Predictors
Global: BLAST NN
Local: PA-SVM
PFAM-SVM
Probabilistic Suffix Trees
Linear
09 / 23 / 2005 eisner@cs.ualberta.ca 17
Why Linear SVMs?
Accurate Explainability
Each term in the dot product in meaningful
09 / 23 / 2005 eisner@cs.ualberta.ca 20
PST
Probabilistic Suffix Trees Efficient Markov chains
Model the protein sequences directly:
Prediction:
09 / 23 / 2005 eisner@cs.ualberta.ca 21
BLAST
Protein Sequence Alignment for a query protein against any set of protein sequences
09 / 23 / 2005 eisner@cs.ualberta.ca 23
Outline
Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion
09 / 23 / 2005 eisner@cs.ualberta.ca 24
Evaluating Predictions in a Hierarchy Not all errors are
equivalent Error to sibling different
than error to unrelated part of hierarchy
Proteins can perform more than one function Need to combine
predictions of multiple functions into a single measure
09 / 23 / 2005 eisner@cs.ualberta.ca 25
Evaluating Predictions in a Hierarchy Semantics of the
hierarchy – True Path Rule
Protein labeled with:
{T} -> {T, A1, A2} Predicted functions:
{S} -> {S, A1, A2} Precision = 2/3 = 67% Recall = 2/3 = 67%
09 / 23 / 2005 eisner@cs.ualberta.ca 26
Evaluating Predictions in a Hierarchy Protein labelled with
{T} -> {T, A1, A2} Predicted:
{C1} -> {C1, T, A1, A2} Precision = 3/4 = 75% Recall = 3/3 = 100%
09 / 23 / 2005 eisner@cs.ualberta.ca 28
Cross-Validation
Used to estimate performance of classification system on future data
5 Fold Cross-Validation:
09 / 23 / 2005 eisner@cs.ualberta.ca 29
Outline
Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion
09 / 23 / 2005 eisner@cs.ualberta.ca 30
Inclusive vs Exclusive Local Predictors In a system of local predictors, how should
each local predictor behave? Two extremes:
A local predictor predicts positive only for those proteins that belong exactly at that node
A local predictor predicts positive for those proteins that belong at or below them in the hierarchy
No a priori reason to choose either
09 / 23 / 2005 eisner@cs.ualberta.ca 33
Training Set Design
Proteins in the current fold’s training set can be used in any way
Need to select for each local predictor: Positive training examples Negative training examples
09 / 23 / 2005 eisner@cs.ualberta.ca 35
Training Set DesignPositive Examples
Negative Examples
Exclusive T Not [T]
Less Exclusive
T Not [ T U Descendants(T)]
Less Inclusive
T U Descendants(T)
Not [ T U Descendants(T)]
Inclusive T U Descendants(T)
Not [ T U Descendants(T) U Ancestors(T)]
09 / 23 / 2005 eisner@cs.ualberta.ca 36
Training Set DesignPositive Examples
Negative Examples
Exclusive T Not [T]
Less Exclusive
T Not [ T U Descendants(T)]
Less Inclusive
T U Descendants(T)
Not [ T U Descendants(T)]
Inclusive T U Descendants(T)
Not [ T U Descendants(T) U Ancestors(T)]
09 / 23 / 2005 eisner@cs.ualberta.ca 37
Training Set DesignPositive Examples
Negative Examples
Exclusive T Not [T]
Less Exclusive
T Not [ T U Descendants(T)]
Less Inclusive
T U Descendants(T)
Not [ T U Descendants(T)]
Inclusive T U Descendants(T)
Not [ T U Descendants(T) U Ancestors(T)]
09 / 23 / 2005 eisner@cs.ualberta.ca 38
Training Set DesignPositive Examples
Negative Examples
Exclusive T Not [T]
Less Exclusive
T Not [ T U Descendants(T)]
Less Inclusive
T U Descendants(T)
Not [ T U Descendants(T)]
Inclusive T U Descendants(T)
Not [ T U Descendants(T) U Ancestors(T)]
09 / 23 / 2005 eisner@cs.ualberta.ca 39
Comparing Training Set Design Schemes Using PA-SVM
Method Precision Recall F1-MeasureExceptions per Protein
Exclusive 75.8% 32.8% 45.8% 1.52
Less Exclusive
77.7% 40.4% 53.1% 1.74
Less Inclusive
77.3% 63.8% 69.9% 0.05
Inclusive 75.3% 65.2% 69.9% 0.09
09 / 23 / 2005 eisner@cs.ualberta.ca 41
Lowering the Cost of Local Predictors Top-Down
Compute local predictors top to bottom until a negative prediction is reached
09 / 23 / 2005 eisner@cs.ualberta.ca 42
Lowering the Cost of Local Predictors Top-Down
Compute local predictors top to bottom until a negative prediction is reached
09 / 23 / 2005 eisner@cs.ualberta.ca 43
Lowering the Cost of Local Predictors Top-Down
Compute local predictors top to bottom until a negative prediction is reached
09 / 23 / 2005 eisner@cs.ualberta.ca 44
Top-Down Search
MethodPrevious
F1-Measure
Top-Down
F1-Measure
Number of Local
Predictors Computed
Exclusive 45.8% 0.4% 10
Less Exclusive
53.1% 2.7% 10
Less Inclusive
69.9% 69.8% 32
Inclusive 69.9% 69.9% 32
09 / 23 / 2005 eisner@cs.ualberta.ca 45
Outline
Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion
09 / 23 / 2005 eisner@cs.ualberta.ca 46
Predictor Results
Predictor Precision Recall
PA-SVM 75.4% 64.8%
PFAM-SVM 74.0% 57.5%
PST 57.5% 63.6%
BLAST 76.7% 69.6%
Voting 76.3% 73.3%
09 / 23 / 2005 eisner@cs.ualberta.ca 47
Similar and Dissimilar Proteins
89% of proteins – at least one good BLAST hit Proteins which are similar (often homologous) to
the set of well studied proteins
11% of proteins – no good BLAST hit Proteins which are not similar to the set of well
studied proteins
09 / 23 / 2005 eisner@cs.ualberta.ca 48
Coverage
Coverage: Percentage of proteins for which a prediction is made
Organism Good BLAST Hit No Good BLAST Hit
D. Melanogaster 60% 40%
S. Cerevisae 62% 38%
09 / 23 / 2005 eisner@cs.ualberta.ca 49
Similar Proteins – Exploiting BLAST BLAST is fast and accurate when a good hit is found
Can exploit this to lower the cost of local predictors Generate candidate nodes Only compute local predictors for candidate nodes Candidate node set should have:
High Recall Minimal Size
09 / 23 / 2005 eisner@cs.ualberta.ca 50
Similar Proteins – Exploiting BLAST candidate nodes
generating methods:
Searching outward from BLAST hit
Performing the union of more than one BLAST hit’s annotations
09 / 23 / 2005 eisner@cs.ualberta.ca 51
Similar Proteins – Exploiting BLAST
Method Precision RecallAvg Cost
per Protein
All 77% 80% 1219
Top-Down 77% 79% 111
BLAST-2-Union 79% 78% 20
BLAST-Search-3 78% 78% 221
09 / 23 / 2005 eisner@cs.ualberta.ca 52
Dissimilar Proteins
Method Precision RecallAvg Cost
per Protein
BLAST 19% 20% 1
Voting 55% 32% 812
Top-Down Voting 56% 32% 58
The more interesting case
09 / 23 / 2005 eisner@cs.ualberta.ca 53
Comparison to Protfun
On a pruned ontology (9 Gene Ontology classes) On 1,637 “no good BLAST hit” proteins
Precision Recall
Protfun 14% 13%
Voting 69% 29%
09 / 23 / 2005 eisner@cs.ualberta.ca 54
Future Work
Try other two ontologies – biological process and cellular component
Use other local predictors More parameter tuning Predictor cost
09 / 23 / 2005 eisner@cs.ualberta.ca 55
Conclusion
Protein Function Prediction provides good leads for Protein Function Determination
Hierarchical ontologies can represent incomplete data allowing the prediction of more functions
Considering the hierarchy: More accurate & Less Computationally Intensive
Methods presented have a higher coverage than BLAST alone
Results accepted to IEEE CIBCB 2005
09 / 23 / 2005 eisner@cs.ualberta.ca 56
Thanks to…
Duane Szafron and Paul Lu
Brett Poulin and Russ Greiner
Everyone in the Proteome Analyst research group
09 / 23 / 2005 eisner@cs.ualberta.ca 57
Incomplete Data & Prediction Inclusive avoids using ambiguous
(incomplete) training data Does this help? To test:
Train on more Incomplete Data: Choose X% of proteins, and move one annotation up
Evaluation Predictions on “Complete” data
09 / 23 / 2005 eisner@cs.ualberta.ca 59
Local vs Global Cross-Validation
Some node predictors have as little as 20 positive examples
How to do cross-validation to make sure each predictor has enough positive training examples?
Recommended