Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes

Tatyana Goldberg November 21, 2013

Subcellular Localization Development of a Prediction Tool

@

Eukaryotic Cell

2 K Rogers (2011) Britannica

Nucleus

Plasma membrane

Smooth Endoplasmic reticulum

Rough Endoplasmic reticulum

Mitochondrion

Cytoplasm

Peroxisome

Golgi apparatus

Protein Function: Examples

G Protein: Signaling &Transport

Across Cell Membranes

© http://www.pdb.org

Collagen: Structural Support

Crystalline: Capturing Light

ATP Synthase: Molecular Motor

Nucleosome: DNA Maintenance

Ribosome: Translation

3

Sequence to Function Gap

• UniProtKB: more than 48 Mio protein sequences (2013_11)

• UniProtKB/Swiss-Prot: 0.5 Mio manually annotated entries (2013_11)

Need for reliable automatic predictions of protein

function from amino acid sequence alone

UniProt Consortium (2011) Nucleic Acids Res A Bairoch and R Apweiler (1997) J. Mol. Med.

MTSHSYYKDRLGFDPNEQQPGSNNSMKRSSSRQTTHHHQSYHHATTSSSQSPARISVSPGGNNGTLEYQQVQRENNW…

Predictor Protein Function

4

Describing Protein Function

Gene Ontology (GO)

Molecular Function

Cellular Component

Biological Process

3461 terms

26116 terms

10543 terms

The Gene Ontology Consortium (2008) Nucleic Acids Res

GO: Cellular Component for HIST1H2AI 5

Stats: Nov 2013

Common compartment Common physiological function K Rogers (2011) Britannica

Cellular Compartmentalization

Prokaryotic Cell Eukaryotic Cell (bacillus type)

6

TM Bakheet and AJ Doig (2008) Bioinformatics

Subcellular Location in Drug Targets

Pero 1%

Microsome 2%

Other 2% Mito

2%

Nucleus 3%

ER 7%

Extra-cellular

13%

Cytoplasm 13%

Membrane 57%

Drug targets tend to be found in membranes, cytoplasm or are extra-cellular!

7

8 L Rajendran, K Simons et al. (2010) Nature Reviews

Mislocalizations Associated with Human Diseases

MC Hung and W Link (2012) JCS 9

Mislocalization of PKC

© http://www.jbc.org/content/279/6.cover-expansion D Chen, AC Newton, et al. (2004) JBC

=> Cytokinesis Failure

Control

10

Control

Microtubules Nuclei

http://www.jbc.org/content/279/6.cover-expansion



Why Subcellular Localization?

• Knowledge of subcellular localization of a protein provides information about its functional role

• Improved target identification during the drug discovery process

• Aberrant protein subcellular location has been observed in the cells of several diseases

• Can help validate or analyze protein protein interactions

11

Localization Predictions Indispensable

High-throughput methods

• cost money

• not accurate

• not complete

Intracellular mitochondrial net-

work, microtubules, nuclei

© http://www.microscopyu.com

SWISS-PROT: Saccharomyces cerevisiae

• ~6,600 manually annotated entries

• ~4,500 have localization information (Nov 2013) 12

Localization is ideal for Predictions

• Localization is an easily identifiable functional feature

• Process of protein trafficking is quite well understood

• Localization data available in public protein databases

• Most proteins active in only one compartment 13

Trafficking in the Cell

Key:

gated transport

transmembrane transport

vesicular transport

B Alberts, et al. (1994) The Cell 14

Sorting Signals

B Alberts et al. (1994) The Cell

• A majority of signal peptides are still unknown

• For signal patches the situation is even worse

15

Protein Trafficking via Sorting Signals

1999 Nobel Prize in Physiology/ Medicine given to Günter Blobel

„for the discovery that proteins have intrinsic signals that govern their transport and localization in the cell“

=> „address tag“ or „zip code“

Nucleus Intracellular Space Eukaryotic Cell PKKKRKV

http://www.time.com

16

http://www.time.com/

17

Predicting Localization

1) Sorting signals (signalP, chloroP)

2) Homology-based inference (R Nair and B Rost, 2002)

3) Text-based analysis (LOCkey)

4) De novo (CELLO v.2.5)

5) Hybrid approaches (LOCtree, MultiLoc2, WoLF PSORT)

18

Problems to Consider

• Biological data is „noisy“ - at the biological, experimental and curation levels

• Prediction performance must be established properly • Sequence similarity between training and test data sets

• Sequence redundancy in public databases

• Prediction results must be interpretable by users • Benchmarking prediction methods can be a difficult task

19

New Method: LocTree2

Archaea

Bacteria

Eukaryota

20

Archaea

Bacteria

Eukaryota

21

New Method: LocTree2

• One framework for all domains of life

• Robustness to sequencing errors

• Predictions for the largest number of classes so far:

• 3 for Archaea • 6 for Bacteria • 18 for Eukaryota

• Membrane predictions

• Decision tree resembling cellular protein sorting

• Reliability score measuring the strength of a prediction

• High performance

Novel Method

Data Set Preparation

Training the Prediction Method

Archaea: 3 classes, 59 sequences

Bacteria: 6 classes, 479 sequences

Eukaryota: 18 classes, 1682 sequences

Testing

Comparison to External Tools

Kernel Selection

Multiclass Classifier Selection

SVMs Parameter Optimization

SVMs

5-f

old

cro

ss-v

alid

atio

n

10

-fo

ld c

ross

-val

idat

ion

22

Novel Method






Testing


Kernel Selection



SVMs

5-f

old

cro

ss-v

alid

atio

n

10

-fo

ld c

ross

-val

idat

ion

23

Data Set for Development

- Exclusion of proteins with non-experimental annotations and of proteins with unclear or multiple localizations

- Identification of transmembrane proteins by „Single-pass“ or „Multi-pass“ keywords in the CC lines

24

Homology Reduction

Bias in protein databases towards certain protein families Construction of sequence unique sets in two steps

1. HSSP-value ≤ 0

2. BLAST E-value > 10-3

0 100 200 300 400 500

0

20

40

60

80

100

Length of alignment

% id

enti

cal r

esid

ues

Below this threshold no reliable annotation of subcellular localization based on sequence homology

C Sander and R Schneider (1991) Proteins; Rost B. (1999) Protein Eng B Rost (2002) J Mol Biol SF Altschul, DJ Lipman, et al. (1990) J Mol Biol

25

Data Statistics

0

5000

10000

15000

20000

25000

30000

35000

Archaea

Bacteria

Eukaryota

SWISS-PROT annotated

(2011_04 release)

# p

rote

ins

HSSP-value ≤ 0 BLAST E-value > 10-3

Homology Reduction 26

Stratified k-fold Cross-Validation

K-fold Cross Testing: a random partitioning into k equally sized subsets, use each subset for testing exactly once and the remaining k-1 subsets for training

Stratified: each subset contains about the same proportion of class labels as the original data set

Performance!

27

Size Increase of the Training Sets

Test Set Training Set SWISS-PROT annotated

Test Set Training Set

HSSP-value ≤0

HSSP-value ≤60

HSSP-value ≤0 HSSP-value ≤60

Size increase by almost a factor of 4!

28

Performance Estimates

29

Performance Estimate: Accuracy

• Number of correctly predicted proteins that are observed to be in localization L divided by the total number of proteins predicted in L

• How often the predicted localization class is correct (also called „precision“ or „specificity“)

30

𝐴𝑐𝑐 𝐿 = 100 𝑇𝑃

𝑇𝑃 + 𝐹𝑃

Performance Estimate: Coverage

• Number of correctly predicted proteins that are observed to be in localization L divided by the total number of proteins observed in L

• How often the observed localization class is predicted correctly (also called „recall“ or „sensitivity“)

31

𝐶𝑜𝑣 𝐿 = 100 𝑇𝑃

𝑇𝑃 + 𝐹𝑁

Measure for the Precision of Estimates

• How reliable are the estimates?

• What is the standard error?

B Efron (1979) The Annals of Statistics

Bootstrapping

• “To pull oneself up by one’s bootstraps” (The Adventures of Baron Munchausen)

• Introduced by Bradley Efron in 1979

• Popularized in 1980s due to the introduction of computers in statistical practice

32

Bootstrapping: the Algorithm

• Start with your set of predictions

• Randomly draw a subset of n predictions from the original set

• Compute an estimate x for this subset of predictions

• Repeat previous two steps m times

bootstrap estimates x1, …, xm

33

Novel Method






Testing


Kernel Selection



SVMs

5-f

old

cro

ss-v

alid

atio

n

10

-fo

ld c

ross

-val

idat

ion

34

Linear SVM

- Linear separation by building a hyperplane

- Hyperplane with the maximum margin is the best

Support vector

Cortes C. and Vapnik V. Machine Learning (1995) 35

What if data is linearly non-separable?

Map data to a higher-dimensional feature space for a linear separation

Feature Map

Φ: x → φ(x)

Protein1

Protein2

Protein3

Protein5

Protein6

Protein7

Vector1

Vector8

Protein8

Protein9

Vector9

Vector2

Vector3

Vector4

Vector5

Vector6

A Kernel function performs this mapping and computes the distance between two vectors in the feature space.

36

String Kernel Selection

Idea: find similarity between two protein sequences by looking at the number of common k-mers

1. String Subsequence Kernel (H Lodhi, S Watkins, et al. 2002 JMLR)

• Text classification • 3 parameters

2. Mismatch Kernel (H Lodhi, S Watkins, et al. 2002 JMLR)

• Protein classification • 2 parameters

3. Profile Kernel (R Kuang, C Leslie, et al. 2004 Bioinformatics)

• Protein classification • 2 parameters and evolutionary profile

37

Profile Kernel: Example

Prot1 CATLGLLGLVAL Prot2 CARRGLLWAVAL

38 C Leslie, W Noble, et al. (2004) Bioinformatics


Profile1

A R T Q …

C 2 3 1 1 …

A 1 4 3 1 …

T 4 2 1 4 …

…

Prot1 CATLGLLGLVAL

e.g. k-mer length=3, conservation threshold=5

C Leslie, W Noble, et al. (2004) Bioinformatics 39


all possible 3-mers = 203

Φ(Prot1)=(0,…,0,…,0,…,0,…,0,…,0)

AAT CAT CTQ CAW

Profile1

A R T Q …

C 2 3 1 1 …

A 1 4 3 1 …

T 4 2 1 4 …

…

Prot1 CATLGLLGLVAL




Φ(Prot1)=(0,…,0,…,0,…,0,…,0,…,0)

AAT CAT CTQ CAW

Profile1

A R T Q …

C 2 3 1 1 …

A 1 4 3 1 …

T 4 2 1 4 …

…

Prot1 CATLGLLGLVAL


CAT

AAT CAW

CAT CQT …

… …

(2+1+1<5)

(1+1+1<5) (1+1+1<5)

(1+1+3=5)



Φ(Prot1)=(0,…,1,…,1,…,1,…,0,…,0)

Profile1

A R T Q …

C 2 3 1 1 …

A 1 4 3 1 …

T 4 2 1 4 …

…

Prot1 CATLGLLGLVAL


CAT

AAT CAW

CAT CQT …

… …

(2+1+1<5)

(1+1+1<5) (1+1+1<5)

(1+1+3=5)

AAT CAT CTQ CAW



Profile1

A R T Q …

C 2 3 1 1 …

A 1 4 3 1 …

T 4 2 1 4 …

…

Prot1 CATLGLLGLVAL


Φ(Prot2)=(0,…,0,…,1,…,1,…,0,…,0)

K(Prot1, Prot2)= Φ(Prot1)· Φ(Prot2) = 32

CAT

AAT CAW

CAT CQT …

…

(2+1+1<5)

(1+1+1<5) (1+1+1<5)

(1+1+3=5)

…

Φ(Prot1)=(0,…,1,…,1,…,1,…,0,…,0)

AAT CAT CTQ CAW


Kernel Selection

SSK Mismatch Kernel Profile Kernel

Archaea 2 classes

97 ± 3 97 ± 3 97 ± 3

Bacteria 6 classes

88 ± 2 88 ± 2 93 ± 2

Eukaryota 10 classes 69 ± 1 Not available 81 ± 1

• For each kernel a range of parameter combinations tested

• Averaged overall accuracies over five training sets are reported

• One-against-all multiclass classifier

The Profile Kernel is a clear winner 44

MultiClass Classifier Selection

1. One – Against – All (V Vapnik 1995 The Nature of Statistical Learning Theory)

2. Ensemble on Nested Dichotomies (S Kramer and E Frank 2004 Proceedings of ICML)

3. Ensemble of Class Balanced Dichotomies (E Frank, S Dong, et al. 2005 PKDD)

4. Ensemble of Data Balanced Dichotomies (E Frank, S Dong, et al. 2005 PKDD)

5. Nested Dichotomy of a fixed structure

45

MultiClass Classifier #1: One - Against -All

Classification of data with labels from >2 classes

• Train n binary classifiers, one for each class against all other classes

• Predicted class is the class of the classifier with the highest value

#1

1 2,…,n

#2

2 1,…,n

#n

n 1,…,n-1

…

V Vapnik (1995) The Nature of Statistical Learning Theory 46

MultiClass Classifier #2: ENDs

Decompose classes into a system of nested dichotomies (NDs).

J Fox (1997) Applied regression analysis, linear models, and related methods. Sage. S Kramer and E Frank (2004) Proceedings of ICML

• Every ND is equally probable! • Take an Ensemble of Nested Dichotomies (ENDs) • The result of END is the average of estimates of individual NDs

… …

47

MultiClass Classifier #3, 4: ECBNDs, EDBNDs

ECBNDs: Ensemble of Class Balanced NDs • Each internal node of a binary tree is class balanced

EDBNDs: Ensemble of Data Balanced NDs • Each internal node of a binary tree is data balanced

Limited number of possible NDs

E Frank, S Dong, et al. (2005) PKDD 48

MultiClass Classifier #5: MyND

Nested Dichotomy of a fixed architecture

49

LocTree2: Prokaryota

SVM

CYT

PM EXT

Archaea (3 classes)

SVM

CYT: cytosol, EXT: extra-cellular, PM: plasma membrane

50


SVM

CYT

PM EXT

Archaea (3 classes)

SVM


Plasma membrane

Cytosol

Plasma membrane

Extra-cellular

51


SVM

CYT

PM EXT

Archaea (3 classes)

SVM


Cytosol

Plasma membrane

Extra-cellular

52


SVM

CYT

PM EXT

Archaea (3 classes)

SVM

Cytosol

Plasma membrane

Extra-cellular


53


Archaea (3 classes) Bacteria (6 classes)

SVM

SVM SVM

PM

PERI

OM

FIM EXT

SVM

SVM

SVM

SVM

CYT

SVM

CYT: cytosol, EXT: extra-cellular, FIM: fimbrium, OM: outer membrane, PERI: periplasmic space, PM: plasma membrane

SVM

CYT

PM EXT

SVM

54

LocTree2: Eukaryota (18 classes)

SVM

SVM

SVM SVM

SVM

ER

EXT GOL NUC CYT VAC

PER

MIT

CHL PLAS

PM GOL VAC

ER NUC

PER

MIT CHL

SVM SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

CHL: chloroplast, CYT: cytosol, ER: endoplasmic reticulum, EXT: extra-cellular, GOL: Golgi apparatus, MIT: mitochondria, NUC: nucleus, PER: peroxisome, PM: plasma membrane, PLAS: plastid, VAC: vacuole

s

s

s s

s s s

s

s s

s s s

s s s s

s

s

s s s s s s

s sz s s s

s s s s s s s s

55


SVM

SVM

SVM SVM

SVM

ER

EXT GOL NUC CYT VAC

PER

MIT

CHL PLAS

PM GOL VAC

ER NUC

PER

MIT CHL

SVM SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

Non-membrane Trans-membrane

SVM SVM SVM SVM

s

s

s s

s

s s

s

s

s

s

s s

s s

s s

s

s

s

s

s s s s

s sz

s s s

s s s s s

s s s

56


SVM

SVM

SVM SVM

SVM

ER

EXT GOL NUC CYT VAC

PER

MIT

CHL PLAS

PM GOL VAC

ER NUC

PER

MIT CHL

SVM SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM


SVM SVM

s

s

s s

s

s s

s

s

s

s

s s

s s

s s

s

s

s

s

s s s s

s sz

s s s

s s s s s

s s s

SVM SVM

57

Secreted Non-secreted SVM


SVM

SVM SVM

SVM

ER

EXT GOL NUC CYT VAC

PER

MIT

CHL PLAS

PM GOL VAC

ER NUC

PER

MIT CHL

SVM SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

Secreted Non-secreted

SVM


s

s

s s

s

s s

s

s

s

s

s s

s s

s s

s

s

s

s

s s s s

s sz

s s s

s s s s s

s s s

SVM SVM

58

MultiClass Classifier Selection

• Averaged overall accuracies over five training sets are reported

No difference in accuracy between ND-based approaches

Selected ENDs (ensemble of random NDs) and MyND (ND of a fixed structure) for the next step

One-against-All

ENDs ECBNDs EDBNDs MyND

Archaea 2 classes

97 ± 3 97 ± 3 97 ± 3 97 ± 3 97 ± 3

Bacteria 6 classes

93 ± 2 96 ± 2 96 ± 2 96 ± 2 96 ± 2

Eukaryota 10 classes 81 ± 1 86 ± 1 86 ± 1 86 ± 1 86 ± 1

59

MultiClass Classifier Selection: SVMs Parameter Optimization

• Increased performance by 1%

• No difference in accuracy between END and MyND

Runtime END MyND

Archaea 3 classes 6.8 · 103 40 · 103

Bacteria 6 classes 2 · 103 15 · 103

Eukaryota 10 classes 0.2 · 103 6.1 · 103

MyND significantly faster: up to 30 times more sequences classified per minute.

Precomputed PSI-BLAST profiles and kernel values!

60

Novel Method (ab initio)






Testing


Kernel Selection



SVMs

5-f

old

cro

ss-v

alid

atio

n

10

-fo

ld c

ross

-val

idat

ion

61

Eukaryota: Cross-validated Q18=65±2%

SVM

SVM

SVM SVM

SVM

ER

EXT GOL NUC CYT VAC

PER

MIT

CHL PLAS

PM GOL VAC

ER NUC

PER

MIT CHL

SVM SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

Q =#Correctly Predicted

#Proteins

s

s

s s

s s s

s

s s

s

s s

s s s s

s

s

s

s

s s s s s s

z s s s

s s s s s s s s

62

Q =#Correctly Predicted

#Proteins

SVM

SVM

SVM SVM

SVM

ER

EXT GOL NUC CYT VAC

PER

MIT

CHL PLAS

PM GOL VAC

ER NUC

PER

MIT CHL

SVM SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM


SVM SVM SVM SVM

s

s

s s

s

s s

s

s

s

s

s s

s s

s s

s

s

s

s

s s s s

s sz

s s s

s s s s s

s s s


63

L Käll, A Krogh and E Sonnhammer (2005) Bioinformatics

SVM

SVM

SVM SVM

SVM

ER

EXT GOL NUC CYT VAC

PER

MIT

CHL PLAS

PM GOL VAC

ER NUC

PER

MIT CHL

SVM SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM


SVM SVM SVM SVM

s

s

s s

s

s s

s

s

s

s

s s

s s

s s

s

s

s

s

s s s s

s sz

s s s

s s s s s

s s s

PolyPhobius=95±2%


64

Eukaryota: Cross-validation

Accuracy = observed predicted

TP Coverage =

SVM

SVM

SVM SVM

SVM

ER

EXT GOL NUC CYT VAC

PER

MIT

CHL PLAS

PM GOL VAC

ER NUC

PER

MIT CHL

SVM SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

80±4 91±3

42±33 29±24

44±13 29±9

38±48 27±30

45±8 44±8

44±15 42±14

45±10 46±10

60±15 44±13

67±6 77±6

68±22 48±19

50±50 21±28

65

Eukaryota: Cross-validation

Accuracy = observed predicted

TP Coverage =

SVM

SVM

SVM SVM

SVM

ER

EXT GOL NUC CYT VAC

PER

MIT

CHL PLAS

PM GOL VAC

ER NUC

PER

MIT CHL

SVM SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

SVM

80±4 91±3

42±33 29±24

44±13 29±9

38±48 27±30

45±8 44±8

44±15 42±14

45±10 46±10

60±15 44±13

67±6 77±6

68±22 48±19

50±50 21±28

66

LocTree2‘s Novelty

Archaea (3)

Bacteria (6)

Eukaryota (18)

67

Novel Method






Testing


Kernel Selection



SVMs

5-f

old

cro

ss-v

alid

atio

n

10

-fo

ld c

ross

-val

idat

ion

68

CELLO V.2.5

• Two-level system of SVMs

• Predicts 5 localization classes for bacteria and 12 for eukaryota

• Bacteria: data set derived from Swiss-Prot release 40.29, not homology reduced Eukaryota: data set derived from Swiss-Prot release 39, homology reduced

• http://cello.life.nctu.edu.tw/

CS Yu, JK Hwang, et al. (2006) Proteins 69

http://cello.life.nctu.edu.tw/

http://cello.life.nctu.edu.tw/

LOCtree

© http://rostlab.org/ • Hierarchical system of SVMs

• Predicts 3 localization classes for prokaryotes, 6 for plants and 5 for animals

• Data set derived from Swiss-Prot release 40, homology reduced

• http://www.predictprotein.org/

R Nair and B Rost (2005) J. Mol.Biol. 70

http://rostlab.org/cms/uploads/media/pp2_20101208_localization2.pdf



http://www.predictprotein.org/

http://www.predictprotein.org/

MultiLoc2

• Two – level system of SVMs

• Predicts 10 localization classes for animals and fungi, 11 classes for plants

• Data set derived from Swiss-Prot release 42, homology reduced

• http://abi.inf.uni-tuebingen.de/Publications/MultiLoc2

T Blum, O Kohlbacher, et al. (2009) BMC Bioinformatics 71

http://abi.inf.uni-tuebingen.de/Publications/MultiLoc2




WoLF PSORT

• k-NN classifier, extension of PSORT II

• Predicts 12 localization classes for eukaryotic proteins

• Data set derived from Swiss-Prot release 45 and not homology reduced + Arabidopsis thaliana entries from the GO website

• http://wolfpsort.org/

P Horton, K Nakai, et al. (2007) NAR

©http://wolfpsort.org

72

http://wolfpsort.org/



LocTree2 Competitive for New Proteins

* NY Yu et al. (2010) Bioinformatics ° R Nair and Rost (2005) JoMB ** P Horton et al. (2007) NAR

New Swiss-Prot : Bacteria

Q5:

LocTree2 PSORTb 3.0*

86 ± 16 71 ± 21

LocTree2 LocTree°

86 ± 18 72 ± 21 Q3:

New Swiss-Prot : Eukaryota

Q9:

LocTree2 WoLF PSORT**

65 ± 14 62 ± 14

LocTree2 LocTree

66 ± 15 62 ± 17 Q5:

73

Best Even for Sequencing Mistakes

Sequence-unique eukaryotic proteins from Swiss-Prot and LocDB databases

Full length 30N removed 30C removed 1/3 randomly

removed

62 ± 2 54 ± 2 60 ± 2 53 ± 2

56 ± 2 35 ± 2 47 ± 2 48 ± 2

56 ± 2 40 ± 2 52 ± 2 49 ± 2

LocTree 2

CELLO v. 2.5

WoLF PSORT

S Rastogi and B Rost (2011) NAR 74

Conclusion

LocTree2 is a method for subcellular localization of proteins that • is available at http://www.rostlab.org/services/loctree2 • has high performnace accurate in distinguishing membrane/water-soluble proteins

• is robust against sequencing errors suitable for large-scale experiments

• aids the prediction of protein function

• is fast

75

http://www.rostlab.org/services/loctree2

Many Thanks to

Burkhard Rost

Tobias Hamp

RostLab@TUM 76

One more thing…

Experimental Annotation of Organisms: Related Species Differ in Localization?

Human (H. sapiens) Orangutan (P. abelii)

PM

CYT

PER EXT

VAC

Source: SWISS-PROT

Chicken (G. gallus)

Plant (A. thaliana) Fish (D. rerio) Yeast (S. pombe)

NUC

NUC

NUC NUC

NUC

ER GOL

MIT

CHL

NUC

78

LocTree2 Annotation of Organisms: Related Species Are Similar in Localization

Human (H. sapiens) Orangutan (P. abelii) Chicken (G. gallus)


Source: SWISS-PROT, LocTree2

PM

CYT

ER

GOL

MIT

NUC

PER

EXT

VAC

79

Predicted Annotation of Organisms: Easy, Fast & High Coverage

Human (H. sapiens) Orangutan (P. abelii) Chicken (G. gallus)


Source: SWISS-PROT, CELLO v.2.5, WoLF PSORT, Yloc, LocTree2

PM

CYT

ER GOL

MIT

NUC

EXT

PER

VAC

80

Documents

Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes