Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
© Burkhard Rost 1
09.06.2016
title: TMSEGMichael Bernhofer, Jonas Reeb
short title: pp1_tmseg
lecture: Protein Prediction 1 (for Computational Biology) – Protein structureTUM summer semester
© Burkhard Rost 2
Last time…
© Burkhard Rost 3
© Burkhard Rost
More data available
• Re-training old methods is viable but no one does it
Less extensive machine learning
Runtime
4
Yet another transmembrane predictor?
© Burkhard Rost
166 membrane protein sequences (TMP166)• TMH assignment from 3D-structure by OPM & PDBTM
Assignments differ, both used for training
• Map to UniProt sequence using SIFTS
• Redundancy reduction with Uniqueprot at HVAL>0
5
Dataset – Transmembrane helices I
Lomize et al., 2006, Bioinformatics Velankar et al., 2013, NARKozma et al., 2013, NAR Mika et al., 2003, NAR
© Burkhard Rost
Inside/Outside topology assignment OPM
6
Dataset – Transmembrane helices II
Lomize et al., 2006, Bioinformatics
© Burkhard Rost
Derived from the SignalP 4.0 training set
Redundancy reduced against set of 166 TMPs
at HVAL>0
Redundancy reduced within at HVAL>0
7
Dataset – Proteins w/ and w/o signal peptides
Soluble: 1142 (452 w/ SP)
Membrane: 299 (25 w/ SP)
SP1441
© Burkhard Rost
Split into 4 subsets, maintaining distribution of
TMPs, SPs and sequence lengths
Use 3 sets for cross-validation, keep one for
final independent evaluation (Blind set)
8
Dataset – Split
TMP166
Blind
Train
41SP1441
Blind285
Train
© Burkhard Rost
Given 𝑁 training samples and 𝑀 input features
find the best recursive partitioning to predict
the class labels in the leaf nodes
Splitting, pruning, balancing ... approaches
differentiate algorithms
9
Classification trees
© Burkhard Rost 10
Classification trees example
Loh, 2011, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
© Burkhard Rost
Ensemble method: grow 𝑇 trees for a forest
For 𝑀 input features, choose 𝑚 < 𝑀
For each 𝑡 ∈ 𝑇:• Select 𝑁 training samples with replacement from all 𝑁 samples
• At every split, choose 𝑚 random features. Use the best split among those for building the tree
11
Random forests
© Burkhard Rost
Fast
No black box
Intuitive to
interpret
Good
performance
12
Random forests - Popularity
Jensen et al., 2011, Bioinformatics
© Burkhard Rost
Random Forest (𝑇 = 100,𝑚 = 9)
Sliding window of 19 residues (𝑤 = 19)
3 scores for each residue (0-1000):• Signal peptide
• Transmembrane helix
• Soluble
Scores scaled from 0.0..1.0 to 0..1000
13
TMSEG step 1 – Initial prediction
© Burkhard Rost 14
TMSEG overview – Step 1
© Burkhard Rost
Global features:• Global amino acid composition• Protein length
Local features:• PSSM score• Distance to N- and C-terminus• Average hydrophobicity (Kyte-Doolittle)• % hydrophobic• % charged (positive & negative)• % polar
15
TMSEG step 1 - Feature set I
𝑤 = 9
© Burkhard Rost 16
TMSEG step 1 - Feature set IIAdjusting for conservation
…
Substitutions with score > 0 = 16Substitutions with score < 0 = 79
© Burkhard Rost 17
TMSEG step 1 - Feature set IIIAdjusting for conservation
…
Amino acid composition M (PSSM>0) = 1/16
© Burkhard Rost 18
TMSEG step 1 - Feature set IVAdjusting for conservation
…
Amino acid composition M (PSSM>0) = 1/16Amino acid composition M (PSSM<0) = 3/79
© Burkhard Rost 19
TMSEG step 1 - Feature set VAdjusting for conservation
…
% positive charge (PSSM>0) = 2/16% positive charge (PSSM<0) = 8/79
© Burkhard Rost
Global features:• Global amino acid composition 2*20• Protein length (binned) 1
Local features:• PSSM score 21*19• Distance to N- and C-terminus 2• Average hydrophobicity (Kyte-Doolittle) 2*1• % hydrophobic 2*1• % charged (positive & negative) 2*2• % polar 2*1
20
TMSEG step 1 - Feature set VI
PSSM≶0
PSSM≶0
PSSM≶0
PSSM≶0
PSSM≶0
© Burkhard Rost
Smooth scores with median filter (𝑤 = 5)
Adjust scores to avoid overprediction• soluble: -185
• TMH: -60
Assign each residue to state with highest
score
Remove signal peptides with <4 residues
Remove TMHs with <7 residues21
TMSEG step 2 – Empirical filter
© Burkhard Rost
TMSEG step 2 – ExampleSEQ: M G P R A R P A L L L L ...
SIG: 400 400 100 100 800 600 700 900 100 600 100 800 ...SOL: 500 400 600 500 100 100 100 000 500 100 100 200 ...TMH: 100 200 300 400 100 300 200 100 400 300 800 000 ...
à Median filter
SIG: 400 400 400 400 600 700 700 600 600 600 ...SOL: 500 500 500 400 100 100 100 100 100 100 ...TMH: 100 200 200 300 300 200 200 300 300 300 ...
à Adjust for overprediction
SIG: 400 400 400 400 600 700 700 600 600 600 ...SOL: 315 315 315 215 -85 -85 -85 -85 -85 -85 ...TMH: 040 140 140 240 240 140 140 240 240 240 ...
OUT: S S S S S S S S S S ...
22
© Burkhard Rost 23
TMSEG overview – Step 1 & 2
© Burkhard Rost 24
TMSEG step 3 – Refine TMH prediction I
Neural Network (25 hidden nodes)
Input: TMH segments of variable length
Features:• Amino acid composition 2*20
• Average hydrophobicity (Kyte-Doolittle) 2*1
• % hydrophobic 2*1
• % charged 2*1
• Segment length (exact) 1
PSSM≶0
PSSM≶0
PSSM≶0
PSSM≶0
© Burkhard Rost
Split long TMHs (≥35 residues) into two shorter
TMHs (≥17 residues)• Keep two TMHs if higher average score after split
Adjust TMH endpoints by up to 3 residues in
either direction
25
TMSEG step 3 – Refine TMH prediction II
© Burkhard Rost 26
TMSEG overview – Step 1-3
© Burkhard Rost 27
TMSEG step 4 – Topology prediction I
Random Forest (𝑇 = 100,𝑚 = 7)
Assign soluble segments to side 1 or 2
Features:• Amino acid composition 2*2*20
• % positive charge 2*2*1
• % abs. difference of pos. charge side1/side2 2*1
PSSM≶0
PSSM≶0
PSSM≶0
© Burkhard Rost
Consider only residues close to TMHs• 15 residues next to TMHs and 8 residues into TMHs
Predict topology of N-terminus and
extrapolate
If SP predicted à Residues after SP “outside”
28
TMSEG step 4 – Topology prediction II
© Burkhard Rost 29
TMSEG overview – Step 1-4
© Burkhard Rost
Per-residue measures often misleading
à Score by TMH segments instead
Whole-protein scores: Qok and Qtop
Performance measures I
30
© Burkhard Rost
ri: #correctlypredictedTMHs
#observedTMHsB
pi: #correctlypredictedTMHs
#predictedTMHsB
Qok:
31
Performance measures II
100𝑁 C𝑥𝑖; 𝑥𝑖 = G1, if𝑝𝑖 = 𝑟𝑖 = 100%
0, else
L
MNO
© Burkhard Rost
What is a correctly predicted TMH?Strict criteria
• Endpoint deviation ≤5 residues• Overlap at least 50%
Performance measures III
32
© Burkhard Rost
ti: 100% if toplogy is correct, otherwise 0%
Qtop:
33
Performance measures IV
100𝑁 C𝑦𝑖; 𝑦𝑖 = G1, if𝑡𝑖 = 𝑝𝑖 = 𝑟𝑖 = 100%
0, else
L
MNO
© Burkhard Rost
Performance of TMH predictions
34
© Burkhard Rost
Performance measures – TMP classification
35
FPR: 𝟏𝟎𝟎 ∗ #ofincorrectlypredictedTMPs#ofsolubleproteins
Sensitvity: 𝟏𝟎𝟎 ∗ #ofcorrectlypredictedTMPs#ofobservedTMPs
Compare to a simple predictor („Baseline“)
• Uses only hydrophobicity scale and positive-inside rule
© Burkhard Rost
Very low misclassification rates
TMP classification
Method TMP sensitivity
TMP FPR
Topology correct
Misclassified in human
More mistakes than TMSEG in human
TMSEG 98 ± 2 3 ± 1 93 ± 4 558 -
PolyPhobius 100 ± 0 5 ± 1 78 ± 7 770 212
MEMSAT3 100 ± 0 28 ± 2 93 ± 4 4,313 3,755
MEMSAT-SVM 98 ± 2 14 ± 2 88 ± 5 2,253 1,695
Baseline 95 ± 3 31 ± 2 75 ± 7 5,015 4,457
36
© Burkhard Rost
How to get more data?• Use what was published since starting work
à Data unknown by any method
From 07/2013 to 2016/02:• Only 12 new TMPs published
• Very small dataset
TMSEG predicts every TMH of the 10 recognized TMPs
37
Dataset of 12 new proteins
© Burkhard Rost
High modularity (steps 1-4)
Apply steps 3 and 4 to other methods
• Step3: NN-based TMH prediction improvement
• Step4: RF-based topology prediction
Can this improve other methods?
Applying TMSEG to other methods I
38
© Burkhard Rost
Applying TMSEG to other methods II
39
© Burkhard Rost
Re-entrant regions not
modelled (little data)
Idea: Check “abnormal” TMH segments for re-
entrant• Does not switching topology
increase scores?
40
Potential extensions
© Burkhard Rost
Debian package: http://rostlab.org/debian/pool/main/t/tmseg/
Github: github.com/Rostlab/TMSEG
PredictProtein: predictprotein.org
41
Availability
Yachdav et al., 2014, NAR
© Burkhard Rost 42
Thank you
Unknown source L
© Burkhard Rost
Yachdav, G., Kloppmann, E., Kajan, L., Hecht, M., Goldberg, T., Hamp, T., … Rost, B. (2014). PredictProtein-an open resource for online prediction of protein structural and functional features. Nucleic Acids Research, 42(Web Server issue), W337–43. http://doi.org/10.1093/nar/gku366Jensen, L. J., & Bateman, A. (2011). The rise and fall of supervised machine learning techniques. Bioinformatics, 27(24), 3331–3332. http://doi.org/10.1093/bioinformatics/btr585Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 14–23. http://doi.org/10.1002/widm.8Lomize, M. a, Lomize, A. L., Pogozheva, I. D., & Mosberg, H. I. (2006). OPM: orientations of proteins in membranes database. Bioinformatics (Oxford, England), 22(5), 623–5. http://doi.org/10.1093/bioinformatics/btk023Velankar, S., McNeil, P., Mittard-Runte, V., Suarez, a, Barrell, D., Apweiler, R., & Henrick, K. (2005). E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Research, 33(Database issue), D262–5. http://doi.org/10.1093/nar/gki058Kozma, D., Simon, I., & Tusnády, G. E. (2013). PDBTM: Protein Data Bank of transmembrane proteins after 8 years. Nucleic Acids Research, 41(Database issue), D524–9. http://doi.org/10.1093/nar/gks1169Mika, S., & Rost, B. (2003). UniqueProt: creating representative protein sequence sets. Nucleic Acids Research, 31(13), 3789–3791. http://doi.org/10.1093/nar/gkg620
43
References