pp1g 20160609 tmseg - Rostlab...2016/06/09 · Sliding window of 19 residues (- = 19) 3 scores for each residue (0-1000): • Signal peptide • Transmembrane helix • Soluble Scores

© Burkhard Rost 1

09.06.2016

title: TMSEGMichael Bernhofer, Jonas Reeb

short title: pp1_tmseg

lecture: Protein Prediction 1 (for Computational Biology) – Protein structureTUM summer semester

© Burkhard Rost 2

Last time…

© Burkhard Rost 3

© Burkhard Rost

More data available

• Re-training old methods is viable but no one does it

Less extensive machine learning

Runtime

4

Yet another transmembrane predictor?

© Burkhard Rost

166 membrane protein sequences (TMP166)• TMH assignment from 3D-structure by OPM & PDBTM

Assignments differ, both used for training

• Map to UniProt sequence using SIFTS

• Redundancy reduction with Uniqueprot at HVAL>0

5

Dataset – Transmembrane helices I

Lomize et al., 2006, Bioinformatics Velankar et al., 2013, NARKozma et al., 2013, NAR Mika et al., 2003, NAR

© Burkhard Rost

Inside/Outside topology assignment OPM

6

Dataset – Transmembrane helices II

Lomize et al., 2006, Bioinformatics

© Burkhard Rost

Derived from the SignalP 4.0 training set

Redundancy reduced against set of 166 TMPs

at HVAL>0

Redundancy reduced within at HVAL>0

7

Dataset – Proteins w/ and w/o signal peptides

Soluble: 1142 (452 w/ SP)

Membrane: 299 (25 w/ SP)

SP1441

© Burkhard Rost

Split into 4 subsets, maintaining distribution of

TMPs, SPs and sequence lengths

Use 3 sets for cross-validation, keep one for

final independent evaluation (Blind set)

8

Dataset – Split

TMP166

Blind

Train

41SP1441

Blind285

Train

© Burkhard Rost

Given 𝑁 training samples and 𝑀 input features

find the best recursive partitioning to predict

the class labels in the leaf nodes

Splitting, pruning, balancing ... approaches

differentiate algorithms

9

Classification trees

© Burkhard Rost 10

Classification trees example

Loh, 2011, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

© Burkhard Rost

Ensemble method: grow 𝑇 trees for a forest

For 𝑀 input features, choose 𝑚 < 𝑀

For each 𝑡 ∈ 𝑇:• Select 𝑁 training samples with replacement from all 𝑁 samples

• At every split, choose 𝑚 random features. Use the best split among those for building the tree

11

Random forests

© Burkhard Rost

Fast

No black box

Intuitive to

interpret

Good

performance

12

Random forests - Popularity

Jensen et al., 2011, Bioinformatics

© Burkhard Rost

Random Forest (𝑇 = 100,𝑚 = 9)

Sliding window of 19 residues (𝑤 = 19)

3 scores for each residue (0-1000):• Signal peptide

• Transmembrane helix

• Soluble

Scores scaled from 0.0..1.0 to 0..1000

13

TMSEG step 1 – Initial prediction

© Burkhard Rost 14

TMSEG overview – Step 1

© Burkhard Rost

Global features:• Global amino acid composition• Protein length

Local features:• PSSM score• Distance to N- and C-terminus• Average hydrophobicity (Kyte-Doolittle)• % hydrophobic• % charged (positive & negative)• % polar

15

TMSEG step 1 - Feature set I

𝑤 = 9

© Burkhard Rost 16

TMSEG step 1 - Feature set IIAdjusting for conservation

…

Substitutions with score > 0 = 16Substitutions with score < 0 = 79

© Burkhard Rost 17

TMSEG step 1 - Feature set IIIAdjusting for conservation

…

Amino acid composition M (PSSM>0) = 1/16

© Burkhard Rost 18

TMSEG step 1 - Feature set IVAdjusting for conservation

…

Amino acid composition M (PSSM>0) = 1/16Amino acid composition M (PSSM<0) = 3/79

© Burkhard Rost 19

TMSEG step 1 - Feature set VAdjusting for conservation

…

% positive charge (PSSM>0) = 2/16% positive charge (PSSM<0) = 8/79

© Burkhard Rost

Global features:• Global amino acid composition 2*20• Protein length (binned) 1

Local features:• PSSM score 21*19• Distance to N- and C-terminus 2• Average hydrophobicity (Kyte-Doolittle) 2*1• % hydrophobic 2*1• % charged (positive & negative) 2*2• % polar 2*1

20

TMSEG step 1 - Feature set VI

PSSM≶0

PSSM≶0

PSSM≶0

PSSM≶0

PSSM≶0

© Burkhard Rost

Smooth scores with median filter (𝑤 = 5)

Adjust scores to avoid overprediction• soluble: -185

• TMH: -60

Assign each residue to state with highest

score

Remove signal peptides with <4 residues

Remove TMHs with <7 residues21

TMSEG step 2 – Empirical filter

© Burkhard Rost

TMSEG step 2 – ExampleSEQ: M G P R A R P A L L L L ...

SIG: 400 400 100 100 800 600 700 900 100 600 100 800 ...SOL: 500 400 600 500 100 100 100 000 500 100 100 200 ...TMH: 100 200 300 400 100 300 200 100 400 300 800 000 ...

à Median filter

SIG: 400 400 400 400 600 700 700 600 600 600 ...SOL: 500 500 500 400 100 100 100 100 100 100 ...TMH: 100 200 200 300 300 200 200 300 300 300 ...

à Adjust for overprediction

SIG: 400 400 400 400 600 700 700 600 600 600 ...SOL: 315 315 315 215 -85 -85 -85 -85 -85 -85 ...TMH: 040 140 140 240 240 140 140 240 240 240 ...

OUT: S S S S S S S S S S ...

22

© Burkhard Rost 23

TMSEG overview – Step 1 & 2

© Burkhard Rost 24

TMSEG step 3 – Refine TMH prediction I

Neural Network (25 hidden nodes)

Input: TMH segments of variable length

Features:• Amino acid composition 2*20

• Average hydrophobicity (Kyte-Doolittle) 2*1

• % hydrophobic 2*1

• % charged 2*1

• Segment length (exact) 1

PSSM≶0

PSSM≶0

PSSM≶0

PSSM≶0

© Burkhard Rost

Split long TMHs (≥35 residues) into two shorter

TMHs (≥17 residues)• Keep two TMHs if higher average score after split

Adjust TMH endpoints by up to 3 residues in

either direction

25

TMSEG step 3 – Refine TMH prediction II

© Burkhard Rost 26

TMSEG overview – Step 1-3

© Burkhard Rost 27

TMSEG step 4 – Topology prediction I

Random Forest (𝑇 = 100,𝑚 = 7)

Assign soluble segments to side 1 or 2

Features:• Amino acid composition 2*2*20

• % positive charge 2*2*1

• % abs. difference of pos. charge side1/side2 2*1

PSSM≶0

PSSM≶0

PSSM≶0

© Burkhard Rost

Consider only residues close to TMHs• 15 residues next to TMHs and 8 residues into TMHs

Predict topology of N-terminus and

extrapolate

If SP predicted à Residues after SP “outside”

28

TMSEG step 4 – Topology prediction II

© Burkhard Rost 29

TMSEG overview – Step 1-4

© Burkhard Rost

Per-residue measures often misleading

à Score by TMH segments instead

Whole-protein scores: Qok and Qtop

Performance measures I

30

© Burkhard Rost

ri: #correctlypredictedTMHs

#observedTMHsB

pi: #correctlypredictedTMHs

#predictedTMHsB

Qok:

31

Performance measures II

100𝑁 C𝑥𝑖; 𝑥𝑖 = G1, if𝑝𝑖 = 𝑟𝑖 = 100%

0, else

L

MNO

© Burkhard Rost

What is a correctly predicted TMH?Strict criteria

• Endpoint deviation ≤5 residues• Overlap at least 50%

Performance measures III

32

© Burkhard Rost

ti: 100% if toplogy is correct, otherwise 0%

Qtop:

33

Performance measures IV

100𝑁 C𝑦𝑖; 𝑦𝑖 = G1, if𝑡𝑖 = 𝑝𝑖 = 𝑟𝑖 = 100%

0, else

L

MNO

© Burkhard Rost

Performance of TMH predictions

34

© Burkhard Rost

Performance measures – TMP classification

35

FPR: 𝟏𝟎𝟎 ∗ #ofincorrectlypredictedTMPs#ofsolubleproteins

Sensitvity: 𝟏𝟎𝟎 ∗ #ofcorrectlypredictedTMPs#ofobservedTMPs

Compare to a simple predictor („Baseline“)

• Uses only hydrophobicity scale and positive-inside rule

© Burkhard Rost

Very low misclassification rates

TMP classification

Method TMP sensitivity

TMP FPR

Topology correct

Misclassified in human

More mistakes than TMSEG in human

TMSEG 98 ± 2 3 ± 1 93 ± 4 558 -

PolyPhobius 100 ± 0 5 ± 1 78 ± 7 770 212

MEMSAT3 100 ± 0 28 ± 2 93 ± 4 4,313 3,755

MEMSAT-SVM 98 ± 2 14 ± 2 88 ± 5 2,253 1,695

Baseline 95 ± 3 31 ± 2 75 ± 7 5,015 4,457

36

© Burkhard Rost

How to get more data?• Use what was published since starting work

à Data unknown by any method

From 07/2013 to 2016/02:• Only 12 new TMPs published

• Very small dataset

TMSEG predicts every TMH of the 10 recognized TMPs

37

Dataset of 12 new proteins

© Burkhard Rost

High modularity (steps 1-4)

Apply steps 3 and 4 to other methods

• Step3: NN-based TMH prediction improvement

• Step4: RF-based topology prediction

Can this improve other methods?

Applying TMSEG to other methods I

38

© Burkhard Rost

Applying TMSEG to other methods II

39

© Burkhard Rost

Re-entrant regions not

modelled (little data)

Idea: Check “abnormal” TMH segments for re-

entrant• Does not switching topology

increase scores?

40

Potential extensions

© Burkhard Rost

Debian package: http://rostlab.org/debian/pool/main/t/tmseg/

Github: github.com/Rostlab/TMSEG

PredictProtein: predictprotein.org

41

Availability

Yachdav et al., 2014, NAR

© Burkhard Rost 42

Thank you

Unknown source L

© Burkhard Rost

Yachdav, G., Kloppmann, E., Kajan, L., Hecht, M., Goldberg, T., Hamp, T., … Rost, B. (2014). PredictProtein-an open resource for online prediction of protein structural and functional features. Nucleic Acids Research, 42(Web Server issue), W337–43. http://doi.org/10.1093/nar/gku366Jensen, L. J., & Bateman, A. (2011). The rise and fall of supervised machine learning techniques. Bioinformatics, 27(24), 3331–3332. http://doi.org/10.1093/bioinformatics/btr585Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 14–23. http://doi.org/10.1002/widm.8Lomize, M. a, Lomize, A. L., Pogozheva, I. D., & Mosberg, H. I. (2006). OPM: orientations of proteins in membranes database. Bioinformatics (Oxford, England), 22(5), 623–5. http://doi.org/10.1093/bioinformatics/btk023Velankar, S., McNeil, P., Mittard-Runte, V., Suarez, a, Barrell, D., Apweiler, R., & Henrick, K. (2005). E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Research, 33(Database issue), D262–5. http://doi.org/10.1093/nar/gki058Kozma, D., Simon, I., & Tusnády, G. E. (2013). PDBTM: Protein Data Bank of transmembrane proteins after 8 years. Nucleic Acids Research, 41(Database issue), D524–9. http://doi.org/10.1093/nar/gks1169Mika, S., & Rost, B. (2003). UniqueProt: creating representative protein sequence sets. Nucleic Acids Research, 31(13), 3789–3791. http://doi.org/10.1093/nar/gkg620

43

References

Documents

pp1g 20160609 tmseg - Rostlab...2016/06/09 · Sliding window of 19 residues (- = 19) 3 scores for each residue (0-1000): • Signal peptide • Transmembrane helix • Soluble Scores