Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Tatyana Goldberg November 21, 2013
Subcellular Localization Development of a Prediction Tool
@
Eukaryotic Cell
2 K Rogers (2011) Britannica
Nucleus
Plasma membrane
Smooth Endoplasmic reticulum
Rough Endoplasmic reticulum
Mitochondrion
Cytoplasm
Peroxisome
Golgi apparatus
Protein Function: Examples
G Protein: Signaling &Transport
Across Cell Membranes
© http://www.pdb.org
Collagen: Structural Support
Crystalline: Capturing Light
ATP Synthase: Molecular Motor
Nucleosome: DNA Maintenance
Ribosome: Translation
3
Sequence to Function Gap
• UniProtKB: more than 48 Mio protein sequences (2013_11)
• UniProtKB/Swiss-Prot: 0.5 Mio manually annotated entries (2013_11)
Need for reliable automatic predictions of protein
function from amino acid sequence alone
UniProt Consortium (2011) Nucleic Acids Res A Bairoch and R Apweiler (1997) J. Mol. Med.
MTSHSYYKDRLGFDPNEQQPGSNNSMKRSSSRQTTHHHQSYHHATTSSSQSPARISVSPGGNNGTLEYQQVQRENNW…
Predictor Protein Function
4
Describing Protein Function
Gene Ontology (GO)
Molecular Function
Cellular Component
Biological Process
3461 terms
26116 terms
10543 terms
The Gene Ontology Consortium (2008) Nucleic Acids Res
GO: Cellular Component for HIST1H2AI 5
Stats: Nov 2013
Common compartment Common physiological function K Rogers (2011) Britannica
Cellular Compartmentalization
Prokaryotic Cell Eukaryotic Cell (bacillus type)
6
TM Bakheet and AJ Doig (2008) Bioinformatics
Subcellular Location in Drug Targets
Pero 1%
Microsome 2%
Other 2% Mito
2%
Nucleus 3%
ER 7%
Extra-cellular
13%
Cytoplasm 13%
Membrane 57%
Drug targets tend to be found in membranes, cytoplasm or are extra-cellular!
7
8 L Rajendran, K Simons et al. (2010) Nature Reviews
Mislocalizations Associated with Human Diseases
MC Hung and W Link (2012) JCS 9
Mislocalization of PKC
© http://www.jbc.org/content/279/6.cover-expansion D Chen, AC Newton, et al. (2004) JBC
=> Cytokinesis Failure
Control
10
Control
Microtubules Nuclei
Why Subcellular Localization?
• Knowledge of subcellular localization of a protein provides information about its functional role
• Improved target identification during the drug discovery process
• Aberrant protein subcellular location has been observed in the cells of several diseases
• Can help validate or analyze protein protein interactions
11
Localization Predictions Indispensable
High-throughput methods
• cost money
• not accurate
• not complete
Intracellular mitochondrial net-
work, microtubules, nuclei
© http://www.microscopyu.com
SWISS-PROT: Saccharomyces cerevisiae
• ~6,600 manually annotated entries
• ~4,500 have localization information (Nov 2013) 12
Localization is ideal for Predictions
• Localization is an easily identifiable functional feature
• Process of protein trafficking is quite well understood
• Localization data available in public protein databases
• Most proteins active in only one compartment 13
Trafficking in the Cell
Key:
gated transport
transmembrane transport
vesicular transport
B Alberts, et al. (1994) The Cell 14
Sorting Signals
B Alberts et al. (1994) The Cell
• A majority of signal peptides are still unknown
• For signal patches the situation is even worse
15
Protein Trafficking via Sorting Signals
1999 Nobel Prize in Physiology/ Medicine given to Günter Blobel
„for the discovery that proteins have intrinsic signals that govern their transport and localization in the cell“
=> „address tag“ or „zip code“
Nucleus Intracellular Space Eukaryotic Cell PKKKRKV
http://www.time.com
16
17
Predicting Localization
1) Sorting signals (signalP, chloroP)
2) Homology-based inference (R Nair and B Rost, 2002)
3) Text-based analysis (LOCkey)
4) De novo (CELLO v.2.5)
5) Hybrid approaches (LOCtree, MultiLoc2, WoLF PSORT)
18
Problems to Consider
• Biological data is „noisy“ - at the biological, experimental and curation levels
• Prediction performance must be established properly • Sequence similarity between training and test data sets
• Sequence redundancy in public databases
• Prediction results must be interpretable by users • Benchmarking prediction methods can be a difficult task
19
New Method: LocTree2
Archaea
Bacteria
Eukaryota
20
Archaea
Bacteria
Eukaryota
21
New Method: LocTree2
• One framework for all domains of life
• Robustness to sequencing errors
• Predictions for the largest number of classes so far:
• 3 for Archaea • 6 for Bacteria • 18 for Eukaryota
• Membrane predictions
• Decision tree resembling cellular protein sorting
• Reliability score measuring the strength of a prediction
• High performance
Novel Method
Data Set Preparation
Training the Prediction Method
Archaea: 3 classes, 59 sequences
Bacteria: 6 classes, 479 sequences
Eukaryota: 18 classes, 1682 sequences
Testing
Comparison to External Tools
Kernel Selection
Multiclass Classifier Selection
SVMs Parameter Optimization
SVMs
5-f
old
cro
ss-v
alid
atio
n
10
-fo
ld c
ross
-val
idat
ion
22
Novel Method
Data Set Preparation
Training the Prediction Method
Archaea: 3 classes, 59 sequences
Bacteria: 6 classes, 479 sequences
Eukaryota: 18 classes, 1682 sequences
Testing
Comparison to External Tools
Kernel Selection
Multiclass Classifier Selection
SVMs Parameter Optimization
SVMs
5-f
old
cro
ss-v
alid
atio
n
10
-fo
ld c
ross
-val
idat
ion
23
Data Set for Development
- Exclusion of proteins with non-experimental annotations and of proteins with unclear or multiple localizations
- Identification of transmembrane proteins by „Single-pass“ or „Multi-pass“ keywords in the CC lines
24
Homology Reduction
Bias in protein databases towards certain protein families Construction of sequence unique sets in two steps
1. HSSP-value ≤ 0
2. BLAST E-value > 10-3
0 100 200 300 400 500
0
20
40
60
80
100
Length of alignment
% id
enti
cal r
esid
ues
Below this threshold no reliable annotation of subcellular localization based on sequence homology
C Sander and R Schneider (1991) Proteins; Rost B. (1999) Protein Eng B Rost (2002) J Mol Biol SF Altschul, DJ Lipman, et al. (1990) J Mol Biol
25
Data Statistics
0
5000
10000
15000
20000
25000
30000
35000
Archaea
Bacteria
Eukaryota
SWISS-PROT annotated
(2011_04 release)
# p
rote
ins
HSSP-value ≤ 0 BLAST E-value > 10-3
Homology Reduction 26
Stratified k-fold Cross-Validation
K-fold Cross Testing: a random partitioning into k equally sized subsets, use each subset for testing exactly once and the remaining k-1 subsets for training
Stratified: each subset contains about the same proportion of class labels as the original data set
Performance!
27
Size Increase of the Training Sets
Test Set Training Set SWISS-PROT annotated
Test Set Training Set
HSSP-value ≤0
HSSP-value ≤60
HSSP-value ≤0 HSSP-value ≤60
Size increase by almost a factor of 4!
28
Performance Estimates
29
Performance Estimate: Accuracy
• Number of correctly predicted proteins that are observed to be in localization L divided by the total number of proteins predicted in L
• How often the predicted localization class is correct (also called „precision“ or „specificity“)
30
𝐴𝑐𝑐 𝐿 = 100 𝑇𝑃
𝑇𝑃 + 𝐹𝑃
Performance Estimate: Coverage
• Number of correctly predicted proteins that are observed to be in localization L divided by the total number of proteins observed in L
• How often the observed localization class is predicted correctly (also called „recall“ or „sensitivity“)
31
𝐶𝑜𝑣 𝐿 = 100 𝑇𝑃
𝑇𝑃 + 𝐹𝑁
Measure for the Precision of Estimates
• How reliable are the estimates?
• What is the standard error?
B Efron (1979) The Annals of Statistics
Bootstrapping
• “To pull oneself up by one’s bootstraps” (The Adventures of Baron Munchausen)
• Introduced by Bradley Efron in 1979
• Popularized in 1980s due to the introduction of computers in statistical practice
32
Bootstrapping: the Algorithm
• Start with your set of predictions
• Randomly draw a subset of n predictions from the original set
• Compute an estimate x for this subset of predictions
• Repeat previous two steps m times
bootstrap estimates x1, …, xm
33
Novel Method
Data Set Preparation
Training the Prediction Method
Archaea: 3 classes, 59 sequences
Bacteria: 6 classes, 479 sequences
Eukaryota: 18 classes, 1682 sequences
Testing
Comparison to External Tools
Kernel Selection
Multiclass Classifier Selection
SVMs Parameter Optimization
SVMs
5-f
old
cro
ss-v
alid
atio
n
10
-fo
ld c
ross
-val
idat
ion
34
Linear SVM
- Linear separation by building a hyperplane
- Hyperplane with the maximum margin is the best
Support vector
Cortes C. and Vapnik V. Machine Learning (1995) 35
What if data is linearly non-separable?
Map data to a higher-dimensional feature space for a linear separation
Feature Map
Φ: x → φ(x)
Protein1
Protein2
Protein3
Protein5
Protein6
Protein7
Vector1
Vector8
Protein8
Protein9
Vector9
Vector2
Vector3
Vector4
Vector5
Vector6
A Kernel function performs this mapping and computes the distance between two vectors in the feature space.
36
String Kernel Selection
Idea: find similarity between two protein sequences by looking at the number of common k-mers
1. String Subsequence Kernel (H Lodhi, S Watkins, et al. 2002 JMLR)
• Text classification • 3 parameters
2. Mismatch Kernel (H Lodhi, S Watkins, et al. 2002 JMLR)
• Protein classification • 2 parameters
3. Profile Kernel (R Kuang, C Leslie, et al. 2004 Bioinformatics)
• Protein classification • 2 parameters and evolutionary profile
37
Profile Kernel: Example
Prot1 CATLGLLGLVAL Prot2 CARRGLLWAVAL
38 C Leslie, W Noble, et al. (2004) Bioinformatics
Profile Kernel: Example
Profile1
A R T Q …
C 2 3 1 1 …
A 1 4 3 1 …
T 4 2 1 4 …
…
Prot1 CATLGLLGLVAL
e.g. k-mer length=3, conservation threshold=5
C Leslie, W Noble, et al. (2004) Bioinformatics 39
Profile Kernel: Example
all possible 3-mers = 203
Φ(Prot1)=(0,…,0,…,0,…,0,…,0,…,0)
AAT CAT CTQ CAW
Profile1
A R T Q …
C 2 3 1 1 …
A 1 4 3 1 …
T 4 2 1 4 …
…
Prot1 CATLGLLGLVAL
e.g. k-mer length=3, conservation threshold=5
C Leslie, W Noble, et al. (2004) Bioinformatics 40
Profile Kernel: Example
Φ(Prot1)=(0,…,0,…,0,…,0,…,0,…,0)
AAT CAT CTQ CAW
Profile1
A R T Q …
C 2 3 1 1 …
A 1 4 3 1 …
T 4 2 1 4 …
…
Prot1 CATLGLLGLVAL
e.g. k-mer length=3, conservation threshold=5
CAT
AAT CAW
CAT CQT …
… …
(2+1+1<5)
(1+1+1<5) (1+1+1<5)
(1+1+3=5)
C Leslie, W Noble, et al. (2004) Bioinformatics 41
Profile Kernel: Example
Φ(Prot1)=(0,…,1,…,1,…,1,…,0,…,0)
Profile1
A R T Q …
C 2 3 1 1 …
A 1 4 3 1 …
T 4 2 1 4 …
…
Prot1 CATLGLLGLVAL
e.g. k-mer length=3, conservation threshold=5
CAT
AAT CAW
CAT CQT …
… …
(2+1+1<5)
(1+1+1<5) (1+1+1<5)
(1+1+3=5)
AAT CAT CTQ CAW
C Leslie, W Noble, et al. (2004) Bioinformatics 42
Profile Kernel: Example
Profile1
A R T Q …
C 2 3 1 1 …
A 1 4 3 1 …
T 4 2 1 4 …
…
Prot1 CATLGLLGLVAL
e.g. k-mer length=3, conservation threshold=5
Φ(Prot2)=(0,…,0,…,1,…,1,…,0,…,0)
K(Prot1, Prot2)= Φ(Prot1)· Φ(Prot2) = 32
CAT
AAT CAW
CAT CQT …
…
(2+1+1<5)
(1+1+1<5) (1+1+1<5)
(1+1+3=5)
…
Φ(Prot1)=(0,…,1,…,1,…,1,…,0,…,0)
AAT CAT CTQ CAW
C Leslie, W Noble, et al. (2004) Bioinformatics 43
Kernel Selection
SSK Mismatch Kernel Profile Kernel
Archaea 2 classes
97 ± 3 97 ± 3 97 ± 3
Bacteria 6 classes
88 ± 2 88 ± 2 93 ± 2
Eukaryota 10 classes 69 ± 1 Not available 81 ± 1
• For each kernel a range of parameter combinations tested
• Averaged overall accuracies over five training sets are reported
• One-against-all multiclass classifier
The Profile Kernel is a clear winner 44
MultiClass Classifier Selection
1. One – Against – All (V Vapnik 1995 The Nature of Statistical Learning Theory)
2. Ensemble on Nested Dichotomies (S Kramer and E Frank 2004 Proceedings of ICML)
3. Ensemble of Class Balanced Dichotomies (E Frank, S Dong, et al. 2005 PKDD)
4. Ensemble of Data Balanced Dichotomies (E Frank, S Dong, et al. 2005 PKDD)
5. Nested Dichotomy of a fixed structure
45
MultiClass Classifier #1: One - Against -All
Classification of data with labels from >2 classes
• Train n binary classifiers, one for each class against all other classes
• Predicted class is the class of the classifier with the highest value
#1
1 2,…,n
#2
2 1,…,n
#n
n 1,…,n-1
…
V Vapnik (1995) The Nature of Statistical Learning Theory 46
MultiClass Classifier #2: ENDs
Decompose classes into a system of nested dichotomies (NDs).
J Fox (1997) Applied regression analysis, linear models, and related methods. Sage. S Kramer and E Frank (2004) Proceedings of ICML
• Every ND is equally probable! • Take an Ensemble of Nested Dichotomies (ENDs) • The result of END is the average of estimates of individual NDs
… …
47
MultiClass Classifier #3, 4: ECBNDs, EDBNDs
ECBNDs: Ensemble of Class Balanced NDs • Each internal node of a binary tree is class balanced
EDBNDs: Ensemble of Data Balanced NDs • Each internal node of a binary tree is data balanced
Limited number of possible NDs
E Frank, S Dong, et al. (2005) PKDD 48
MultiClass Classifier #5: MyND
Nested Dichotomy of a fixed architecture
49
LocTree2: Prokaryota
SVM
CYT
PM EXT
Archaea (3 classes)
SVM
CYT: cytosol, EXT: extra-cellular, PM: plasma membrane
50
LocTree2: Prokaryota
SVM
CYT
PM EXT
Archaea (3 classes)
SVM
CYT: cytosol, EXT: extra-cellular, PM: plasma membrane
Plasma membrane
Cytosol
Plasma membrane
Extra-cellular
51
LocTree2: Prokaryota
SVM
CYT
PM EXT
Archaea (3 classes)
SVM
CYT: cytosol, EXT: extra-cellular, PM: plasma membrane
Cytosol
Plasma membrane
Extra-cellular
52
LocTree2: Prokaryota
SVM
CYT
PM EXT
Archaea (3 classes)
SVM
Cytosol
Plasma membrane
Extra-cellular
CYT: cytosol, EXT: extra-cellular, PM: plasma membrane
53
LocTree2: Prokaryota
Archaea (3 classes) Bacteria (6 classes)
SVM
SVM SVM
PM
PERI
OM
FIM EXT
SVM
SVM
SVM
SVM
CYT
SVM
CYT: cytosol, EXT: extra-cellular, FIM: fimbrium, OM: outer membrane, PERI: periplasmic space, PM: plasma membrane
SVM
CYT
PM EXT
SVM
54
LocTree2: Eukaryota (18 classes)
SVM
SVM
SVM SVM
SVM
ER
EXT GOL NUC CYT VAC
PER
MIT
CHL PLAS
PM GOL VAC
ER NUC
PER
MIT CHL
SVM SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
CHL: chloroplast, CYT: cytosol, ER: endoplasmic reticulum, EXT: extra-cellular, GOL: Golgi apparatus, MIT: mitochondria, NUC: nucleus, PER: peroxisome, PM: plasma membrane, PLAS: plastid, VAC: vacuole
s
s
s s
s s s
s
s s
s s s
s s s s
s
s
s s s s s s
s sz s s s
s s s s s s s s
55
LocTree2: Eukaryota (18 classes)
SVM
SVM
SVM SVM
SVM
ER
EXT GOL NUC CYT VAC
PER
MIT
CHL PLAS
PM GOL VAC
ER NUC
PER
MIT CHL
SVM SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
Non-membrane Trans-membrane
SVM SVM SVM SVM
s
s
s s
s
s s
s
s
s
s
s s
s s
s s
s
s
s
s
s s s s
s sz
s s s
s s s s s
s s s
56
LocTree2: Eukaryota (18 classes)
SVM
SVM
SVM SVM
SVM
ER
EXT GOL NUC CYT VAC
PER
MIT
CHL PLAS
PM GOL VAC
ER NUC
PER
MIT CHL
SVM SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
Non-membrane Trans-membrane
SVM SVM
s
s
s s
s
s s
s
s
s
s
s s
s s
s s
s
s
s
s
s s s s
s sz
s s s
s s s s s
s s s
SVM SVM
57
Secreted Non-secreted SVM
Non-membrane Trans-membrane
SVM
SVM SVM
SVM
ER
EXT GOL NUC CYT VAC
PER
MIT
CHL PLAS
PM GOL VAC
ER NUC
PER
MIT CHL
SVM SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
Secreted Non-secreted
SVM
LocTree2: Eukaryota (18 classes)
s
s
s s
s
s s
s
s
s
s
s s
s s
s s
s
s
s
s
s s s s
s sz
s s s
s s s s s
s s s
SVM SVM
58
MultiClass Classifier Selection
• Averaged overall accuracies over five training sets are reported
No difference in accuracy between ND-based approaches
Selected ENDs (ensemble of random NDs) and MyND (ND of a fixed structure) for the next step
One-against-All
ENDs ECBNDs EDBNDs MyND
Archaea 2 classes
97 ± 3 97 ± 3 97 ± 3 97 ± 3 97 ± 3
Bacteria 6 classes
93 ± 2 96 ± 2 96 ± 2 96 ± 2 96 ± 2
Eukaryota 10 classes 81 ± 1 86 ± 1 86 ± 1 86 ± 1 86 ± 1
59
MultiClass Classifier Selection: SVMs Parameter Optimization
• Increased performance by 1%
• No difference in accuracy between END and MyND
Runtime END MyND
Archaea 3 classes 6.8 · 103 40 · 103
Bacteria 6 classes 2 · 103 15 · 103
Eukaryota 10 classes 0.2 · 103 6.1 · 103
MyND significantly faster: up to 30 times more sequences classified per minute.
Precomputed PSI-BLAST profiles and kernel values!
60
Novel Method (ab initio)
Data Set Preparation
Training the Prediction Method
Archaea: 3 classes, 59 sequences
Bacteria: 6 classes, 479 sequences
Eukaryota: 18 classes, 1682 sequences
Testing
Comparison to External Tools
Kernel Selection
Multiclass Classifier Selection
SVMs Parameter Optimization
SVMs
5-f
old
cro
ss-v
alid
atio
n
10
-fo
ld c
ross
-val
idat
ion
61
Eukaryota: Cross-validated Q18=65±2%
SVM
SVM
SVM SVM
SVM
ER
EXT GOL NUC CYT VAC
PER
MIT
CHL PLAS
PM GOL VAC
ER NUC
PER
MIT CHL
SVM SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
Q =#Correctly Predicted
#Proteins
s
s
s s
s s s
s
s s
s
s s
s s s s
s
s
s
s
s s s s s s
z s s s
s s s s s s s s
62
Q =#Correctly Predicted
#Proteins
SVM
SVM
SVM SVM
SVM
ER
EXT GOL NUC CYT VAC
PER
MIT
CHL PLAS
PM GOL VAC
ER NUC
PER
MIT CHL
SVM SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
Non-membrane Trans-membrane
SVM SVM SVM SVM
s
s
s s
s
s s
s
s
s
s
s s
s s
s s
s
s
s
s
s s s s
s sz
s s s
s s s s s
s s s
Eukaryota: Cross-validated Q2=94±2%
63
L Käll, A Krogh and E Sonnhammer (2005) Bioinformatics
SVM
SVM
SVM SVM
SVM
ER
EXT GOL NUC CYT VAC
PER
MIT
CHL PLAS
PM GOL VAC
ER NUC
PER
MIT CHL
SVM SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
Non-membrane Trans-membrane
SVM SVM SVM SVM
s
s
s s
s
s s
s
s
s
s
s s
s s
s s
s
s
s
s
s s s s
s sz
s s s
s s s s s
s s s
PolyPhobius=95±2%
Eukaryota: Cross-validated Q2=94±2%
64
Eukaryota: Cross-validation
Accuracy = observed predicted
TP Coverage =
SVM
SVM
SVM SVM
SVM
ER
EXT GOL NUC CYT VAC
PER
MIT
CHL PLAS
PM GOL VAC
ER NUC
PER
MIT CHL
SVM SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
80±4 91±3
42±33 29±24
44±13 29±9
38±48 27±30
45±8 44±8
44±15 42±14
45±10 46±10
60±15 44±13
67±6 77±6
68±22 48±19
50±50 21±28
65
Eukaryota: Cross-validation
Accuracy = observed predicted
TP Coverage =
SVM
SVM
SVM SVM
SVM
ER
EXT GOL NUC CYT VAC
PER
MIT
CHL PLAS
PM GOL VAC
ER NUC
PER
MIT CHL
SVM SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
80±4 91±3
42±33 29±24
44±13 29±9
38±48 27±30
45±8 44±8
44±15 42±14
45±10 46±10
60±15 44±13
67±6 77±6
68±22 48±19
50±50 21±28
66
LocTree2‘s Novelty
Archaea (3)
Bacteria (6)
Eukaryota (18)
67
Novel Method
Data Set Preparation
Training the Prediction Method
Archaea: 3 classes, 59 sequences
Bacteria: 6 classes, 479 sequences
Eukaryota: 18 classes, 1682 sequences
Testing
Comparison to External Tools
Kernel Selection
Multiclass Classifier Selection
SVMs Parameter Optimization
SVMs
5-f
old
cro
ss-v
alid
atio
n
10
-fo
ld c
ross
-val
idat
ion
68
CELLO V.2.5
• Two-level system of SVMs
• Predicts 5 localization classes for bacteria and 12 for eukaryota
• Bacteria: data set derived from Swiss-Prot release 40.29, not homology reduced Eukaryota: data set derived from Swiss-Prot release 39, homology reduced
• http://cello.life.nctu.edu.tw/
CS Yu, JK Hwang, et al. (2006) Proteins 69
LOCtree
© http://rostlab.org/ • Hierarchical system of SVMs
• Predicts 3 localization classes for prokaryotes, 6 for plants and 5 for animals
• Data set derived from Swiss-Prot release 40, homology reduced
• http://www.predictprotein.org/
R Nair and B Rost (2005) J. Mol.Biol. 70
MultiLoc2
• Two – level system of SVMs
• Predicts 10 localization classes for animals and fungi, 11 classes for plants
• Data set derived from Swiss-Prot release 42, homology reduced
• http://abi.inf.uni-tuebingen.de/Publications/MultiLoc2
T Blum, O Kohlbacher, et al. (2009) BMC Bioinformatics 71
WoLF PSORT
• k-NN classifier, extension of PSORT II
• Predicts 12 localization classes for eukaryotic proteins
• Data set derived from Swiss-Prot release 45 and not homology reduced + Arabidopsis thaliana entries from the GO website
• http://wolfpsort.org/
P Horton, K Nakai, et al. (2007) NAR
©http://wolfpsort.org
72
LocTree2 Competitive for New Proteins
* NY Yu et al. (2010) Bioinformatics ° R Nair and Rost (2005) JoMB ** P Horton et al. (2007) NAR
New Swiss-Prot : Bacteria
Q5:
LocTree2 PSORTb 3.0*
86 ± 16 71 ± 21
LocTree2 LocTree°
86 ± 18 72 ± 21 Q3:
New Swiss-Prot : Eukaryota
Q9:
LocTree2 WoLF PSORT**
65 ± 14 62 ± 14
LocTree2 LocTree
66 ± 15 62 ± 17 Q5:
73
Best Even for Sequencing Mistakes
Sequence-unique eukaryotic proteins from Swiss-Prot and LocDB databases
Full length 30N removed 30C removed 1/3 randomly
removed
62 ± 2 54 ± 2 60 ± 2 53 ± 2
56 ± 2 35 ± 2 47 ± 2 48 ± 2
56 ± 2 40 ± 2 52 ± 2 49 ± 2
LocTree 2
CELLO v. 2.5
WoLF PSORT
S Rastogi and B Rost (2011) NAR 74
Conclusion
LocTree2 is a method for subcellular localization of proteins that • is available at http://www.rostlab.org/services/loctree2 • has high performnace accurate in distinguishing membrane/water-soluble proteins
• is robust against sequencing errors suitable for large-scale experiments
• aids the prediction of protein function
• is fast
75
Many Thanks to
Burkhard Rost
Tobias Hamp
RostLab@TUM 76
One more thing…
Experimental Annotation of Organisms: Related Species Differ in Localization?
Human (H. sapiens) Orangutan (P. abelii)
PM
CYT
PER EXT
VAC
Source: SWISS-PROT
Chicken (G. gallus)
Plant (A. thaliana) Fish (D. rerio) Yeast (S. pombe)
NUC
NUC
NUC NUC
NUC
ER GOL
MIT
CHL
NUC
78
LocTree2 Annotation of Organisms: Related Species Are Similar in Localization
Human (H. sapiens) Orangutan (P. abelii) Chicken (G. gallus)
Plant (A. thaliana) Fish (D. rerio) Yeast (S. pombe)
Source: SWISS-PROT, LocTree2
PM
CYT
ER
GOL
MIT
NUC
PER
EXT
VAC
79
Predicted Annotation of Organisms: Easy, Fast & High Coverage
Human (H. sapiens) Orangutan (P. abelii) Chicken (G. gallus)
Plant (A. thaliana) Fish (D. rerio) Yeast (S. pombe)
Source: SWISS-PROT, CELLO v.2.5, WoLF PSORT, Yloc, LocTree2
PM
CYT
ER GOL
MIT
NUC
EXT
PER
VAC
80