09 / 23 / 2005eisner@cs.ualberta.ca1 Predicting Protein Function Using Machine-Learned Hierarchical...

Preview:

Citation preview

09 / 23 / 2005 eisner@cs.ualberta.ca 1

Predicting Protein Function Using Machine-Learned Hierarchical Classifiers

Roman Eisner

Supervisors: Duane Szafron and Paul Lu

09 / 23 / 2005 eisner@cs.ualberta.ca 2

Outline

Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion

09 / 23 / 2005 eisner@cs.ualberta.ca 3

09 / 23 / 2005 eisner@cs.ualberta.ca 4

Proteins

Functional Units in the cell Perform a Variety of Functions

e.g. Catalysis of reactions, Structural and mechanical roles, transport of other molecules

Can take years to study a single protein Any good leads would be helpful!

09 / 23 / 2005 eisner@cs.ualberta.ca 5

Protein Function Prediction and Protein Function Determination Prediction:

An estimate of what function a protein performs Determination:

Work in a laboratory to observe and discover what function a protein performs

Prediction complements determination

09 / 23 / 2005 eisner@cs.ualberta.ca 6

Proteins

Chain of amino acids 20 Amino Acids

FastA Format:

>P18077 – R35A_HUMAN

MSGRLWSKAIFAGYKRGLRNQREHTALLKIEGVYARDETEFYLGKR

CAYVYKAKNNTVTPGGKPNKTRVIWGKVTRAHGNSGMVRAKFRSNL

PAKAIGHRIRVMLYPSRI

09 / 23 / 2005 eisner@cs.ualberta.ca 7

Ontologies

Standardized Vocabularies (Common Language)

In biological literature, different terms can be used to describe the same function e.g. “peroxiredoxin activity” and

“thioredoxin peroxidase activity” Can be structured in a hierarchy to show

relationships

09 / 23 / 2005 eisner@cs.ualberta.ca 8

Gene Ontology

Directed Acyclic Graph (DAG) Always changing Describes 3 aspects of protein annotations:

Molecular Function Biological Process Cellular Component

09 / 23 / 2005 eisner@cs.ualberta.ca 9

Gene Ontology

Directed Acyclic Graph (DAG) Always changing Describes 3 aspects of protein annotations:

Molecular Function Biological Process Cellular Component

09 / 23 / 2005 eisner@cs.ualberta.ca 10

Hierarchical Ontologies

Can help to represent a large number of classes

Represent General and Specific data Some data is incomplete – could become

more specific in the future

09 / 23 / 2005 eisner@cs.ualberta.ca 11

Incomplete Annotations

09 / 23 / 2005 eisner@cs.ualberta.ca 12

Goal

To predict the function of proteins given their sequence

09 / 23 / 2005 eisner@cs.ualberta.ca 13

Data Set

Protein Sequences UniProt database

Ontology Gene Ontology Molecular Function aspect

Experimental Annotations Gene Ontology Annotation project @ EBI

Pruned Ontology: 406 nodes (out of 7,399) with ≥ 20 proteins

Final Data Set: 14,362 proteins

09 / 23 / 2005 eisner@cs.ualberta.ca 14

Outline

Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion

09 / 23 / 2005 eisner@cs.ualberta.ca 15

Predictors

Global: BLAST NN

Local: PA-SVM

PFAM-SVM

Probabilistic Suffix Trees

09 / 23 / 2005 eisner@cs.ualberta.ca 16

Predictors

Global: BLAST NN

Local: PA-SVM

PFAM-SVM

Probabilistic Suffix Trees

Linear

09 / 23 / 2005 eisner@cs.ualberta.ca 17

Why Linear SVMs?

Accurate Explainability

Each term in the dot product in meaningful

09 / 23 / 2005 eisner@cs.ualberta.ca 18

PA-SVM

Proteome Analyst

09 / 23 / 2005 eisner@cs.ualberta.ca 19

PFAM-SVM

Hidden Markov Models

09 / 23 / 2005 eisner@cs.ualberta.ca 20

PST

Probabilistic Suffix Trees Efficient Markov chains

Model the protein sequences directly:

Prediction:

09 / 23 / 2005 eisner@cs.ualberta.ca 21

BLAST

Protein Sequence Alignment for a query protein against any set of protein sequences

09 / 23 / 2005 eisner@cs.ualberta.ca 22

BLAST

09 / 23 / 2005 eisner@cs.ualberta.ca 23

Outline

Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion

09 / 23 / 2005 eisner@cs.ualberta.ca 24

Evaluating Predictions in a Hierarchy Not all errors are

equivalent Error to sibling different

than error to unrelated part of hierarchy

Proteins can perform more than one function Need to combine

predictions of multiple functions into a single measure

09 / 23 / 2005 eisner@cs.ualberta.ca 25

Evaluating Predictions in a Hierarchy Semantics of the

hierarchy – True Path Rule

Protein labeled with:

{T} -> {T, A1, A2} Predicted functions:

{S} -> {S, A1, A2} Precision = 2/3 = 67% Recall = 2/3 = 67%

09 / 23 / 2005 eisner@cs.ualberta.ca 26

Evaluating Predictions in a Hierarchy Protein labelled with

{T} -> {T, A1, A2} Predicted:

{C1} -> {C1, T, A1, A2} Precision = 3/4 = 75% Recall = 3/3 = 100%

09 / 23 / 2005 eisner@cs.ualberta.ca 27

Supervised Learning

09 / 23 / 2005 eisner@cs.ualberta.ca 28

Cross-Validation

Used to estimate performance of classification system on future data

5 Fold Cross-Validation:

09 / 23 / 2005 eisner@cs.ualberta.ca 29

Outline

Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion

09 / 23 / 2005 eisner@cs.ualberta.ca 30

Inclusive vs Exclusive Local Predictors In a system of local predictors, how should

each local predictor behave? Two extremes:

A local predictor predicts positive only for those proteins that belong exactly at that node

A local predictor predicts positive for those proteins that belong at or below them in the hierarchy

No a priori reason to choose either

09 / 23 / 2005 eisner@cs.ualberta.ca 31

Exclusive Local Predictors

09 / 23 / 2005 eisner@cs.ualberta.ca 32

Inclusive Local Predictors

09 / 23 / 2005 eisner@cs.ualberta.ca 33

Training Set Design

Proteins in the current fold’s training set can be used in any way

Need to select for each local predictor: Positive training examples Negative training examples

09 / 23 / 2005 eisner@cs.ualberta.ca 34

Training Set Design

09 / 23 / 2005 eisner@cs.ualberta.ca 35

Training Set DesignPositive Examples

Negative Examples

Exclusive T Not [T]

Less Exclusive

T Not [ T U Descendants(T)]

Less Inclusive

T U Descendants(T)

Not [ T U Descendants(T)]

Inclusive T U Descendants(T)

Not [ T U Descendants(T) U Ancestors(T)]

09 / 23 / 2005 eisner@cs.ualberta.ca 36

Training Set DesignPositive Examples

Negative Examples

Exclusive T Not [T]

Less Exclusive

T Not [ T U Descendants(T)]

Less Inclusive

T U Descendants(T)

Not [ T U Descendants(T)]

Inclusive T U Descendants(T)

Not [ T U Descendants(T) U Ancestors(T)]

09 / 23 / 2005 eisner@cs.ualberta.ca 37

Training Set DesignPositive Examples

Negative Examples

Exclusive T Not [T]

Less Exclusive

T Not [ T U Descendants(T)]

Less Inclusive

T U Descendants(T)

Not [ T U Descendants(T)]

Inclusive T U Descendants(T)

Not [ T U Descendants(T) U Ancestors(T)]

09 / 23 / 2005 eisner@cs.ualberta.ca 38

Training Set DesignPositive Examples

Negative Examples

Exclusive T Not [T]

Less Exclusive

T Not [ T U Descendants(T)]

Less Inclusive

T U Descendants(T)

Not [ T U Descendants(T)]

Inclusive T U Descendants(T)

Not [ T U Descendants(T) U Ancestors(T)]

09 / 23 / 2005 eisner@cs.ualberta.ca 39

Comparing Training Set Design Schemes Using PA-SVM

Method Precision Recall F1-MeasureExceptions per Protein

Exclusive 75.8% 32.8% 45.8% 1.52

Less Exclusive

77.7% 40.4% 53.1% 1.74

Less Inclusive

77.3% 63.8% 69.9% 0.05

Inclusive 75.3% 65.2% 69.9% 0.09

09 / 23 / 2005 eisner@cs.ualberta.ca 40

Exclusive have more exceptions

09 / 23 / 2005 eisner@cs.ualberta.ca 41

Lowering the Cost of Local Predictors Top-Down

Compute local predictors top to bottom until a negative prediction is reached

09 / 23 / 2005 eisner@cs.ualberta.ca 42

Lowering the Cost of Local Predictors Top-Down

Compute local predictors top to bottom until a negative prediction is reached

09 / 23 / 2005 eisner@cs.ualberta.ca 43

Lowering the Cost of Local Predictors Top-Down

Compute local predictors top to bottom until a negative prediction is reached

09 / 23 / 2005 eisner@cs.ualberta.ca 44

Top-Down Search

MethodPrevious

F1-Measure

Top-Down

F1-Measure

Number of Local

Predictors Computed

Exclusive 45.8% 0.4% 10

Less Exclusive

53.1% 2.7% 10

Less Inclusive

69.9% 69.8% 32

Inclusive 69.9% 69.9% 32

09 / 23 / 2005 eisner@cs.ualberta.ca 45

Outline

Introduction Predictors Evaluation in a Hierarchy Local Predictor Design Experimental Results Conclusion

09 / 23 / 2005 eisner@cs.ualberta.ca 46

Predictor Results

Predictor Precision Recall

PA-SVM 75.4% 64.8%

PFAM-SVM 74.0% 57.5%

PST 57.5% 63.6%

BLAST 76.7% 69.6%

Voting 76.3% 73.3%

09 / 23 / 2005 eisner@cs.ualberta.ca 47

Similar and Dissimilar Proteins

89% of proteins – at least one good BLAST hit Proteins which are similar (often homologous) to

the set of well studied proteins

11% of proteins – no good BLAST hit Proteins which are not similar to the set of well

studied proteins

09 / 23 / 2005 eisner@cs.ualberta.ca 48

Coverage

Coverage: Percentage of proteins for which a prediction is made

Organism Good BLAST Hit No Good BLAST Hit

D. Melanogaster 60% 40%

S. Cerevisae 62% 38%

09 / 23 / 2005 eisner@cs.ualberta.ca 49

Similar Proteins – Exploiting BLAST BLAST is fast and accurate when a good hit is found

Can exploit this to lower the cost of local predictors Generate candidate nodes Only compute local predictors for candidate nodes Candidate node set should have:

High Recall Minimal Size

09 / 23 / 2005 eisner@cs.ualberta.ca 50

Similar Proteins – Exploiting BLAST candidate nodes

generating methods:

Searching outward from BLAST hit

Performing the union of more than one BLAST hit’s annotations

09 / 23 / 2005 eisner@cs.ualberta.ca 51

Similar Proteins – Exploiting BLAST

Method Precision RecallAvg Cost

per Protein

All 77% 80% 1219

Top-Down 77% 79% 111

BLAST-2-Union 79% 78% 20

BLAST-Search-3 78% 78% 221

09 / 23 / 2005 eisner@cs.ualberta.ca 52

Dissimilar Proteins

Method Precision RecallAvg Cost

per Protein

BLAST 19% 20% 1

Voting 55% 32% 812

Top-Down Voting 56% 32% 58

The more interesting case

09 / 23 / 2005 eisner@cs.ualberta.ca 53

Comparison to Protfun

On a pruned ontology (9 Gene Ontology classes) On 1,637 “no good BLAST hit” proteins

Precision Recall

Protfun 14% 13%

Voting 69% 29%

09 / 23 / 2005 eisner@cs.ualberta.ca 54

Future Work

Try other two ontologies – biological process and cellular component

Use other local predictors More parameter tuning Predictor cost

09 / 23 / 2005 eisner@cs.ualberta.ca 55

Conclusion

Protein Function Prediction provides good leads for Protein Function Determination

Hierarchical ontologies can represent incomplete data allowing the prediction of more functions

Considering the hierarchy: More accurate & Less Computationally Intensive

Methods presented have a higher coverage than BLAST alone

Results accepted to IEEE CIBCB 2005

09 / 23 / 2005 eisner@cs.ualberta.ca 56

Thanks to…

Duane Szafron and Paul Lu

Brett Poulin and Russ Greiner

Everyone in the Proteome Analyst research group

09 / 23 / 2005 eisner@cs.ualberta.ca 57

Incomplete Data & Prediction Inclusive avoids using ambiguous

(incomplete) training data Does this help? To test:

Train on more Incomplete Data: Choose X% of proteins, and move one annotation up

Evaluation Predictions on “Complete” data

09 / 23 / 2005 eisner@cs.ualberta.ca 58

Robustness to Incomplete Data

09 / 23 / 2005 eisner@cs.ualberta.ca 59

Local vs Global Cross-Validation

Some node predictors have as little as 20 positive examples

How to do cross-validation to make sure each predictor has enough positive training examples?

09 / 23 / 2005 eisner@cs.ualberta.ca 60

Local vs Global Cross-Validation

Local cross-validation is invalid Predictions must be

consistent Need fold isolation

A single global split global cross-validation