TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

Yan Liu, Jaime Carbonell, Judith Klein- SeetharamanSchool of Computer ScienceCarnegie Mellon University

May 14, 2003

Roadmap• Overview on secondary structure

prediction• Description of TXTpred method• Experiment results and analysis• Discussion and further work

Secondary Structure of a Protein Sequence

• Dictionary of Secondary Structure Prediction annotates each residue with its structure (DSSP)– based on hydrogen bonding patterns

and geometrical constraints • 7 DSSP labels for PSS:

– Helix types: H G (alpha-helix 3/10 helix)– Sheet types: B E (isolated beta-bridge

strand)– Coil types: T _ S (Coil)

Secondary Structure of a Protein Sequence

• Accuracy Limit ~ 88%

Task Definition• Given a protein sequence:

– APAFSVSPASGA• Predict its secondary structure

sequence:– CCEEEEECCCCC– Focus on soluble proteins, not on

membrane protein

Overview of Previous Work -1

• 1st-generation method – Calculate propensities for each amino acid

• E.g. Chou-Fasman method (Chou & Fasman, 1974)• 2nd-generation method

– “Window” concept• APAFSVSPAS (window size = 7)

– Calculate propensities for segments of 3-51 amino acids

• E.g. GOR method (Garnier et al, 1978)

Overview of Previous Work -2

• 3rd-generation method– Use evolutional information multiple

sequence alignment• p-Value cut-off = 10-2 • PHD: Neural Network & Sequence features only (Rost &

Sander, 1993)• DSC: LDA & Biological features: GOR, hydrophobicity

etc. (King & Sternberg, 1996)– Later Refinement

• Apply divergent sequence alignment: e.g. PROF (Ouali & King, 2000)

• Combine results of different system: e.g. Jpred (Cuff & Barton, 1999)

• Bayesian Segmentation (Schmidler et al, 1999)

Summary of Performance

Method Name Performance (Q3)Chou-Fasman ~ 50%

GOR ~ 56%PHD ~ 71%DSC ~ 70%

Disadvantage of Previous Work

• Most are “black box” predictors– Weak biological meanings

• Little focus on long-range interaction– Mostly focused on local information

• Performance is asymptotically bounded

Roadmap• Overview on secondary structure

prediction• Description of TXTpred method• Experiment results and analysis• Discussion and further work

TXTpred• Basic idea:

– Build meaningful biological vocabulary – Apply language technique for prediction

• Major challenge:– How to build the vocabulary?

• Context-free N-gram of amino acids inside the window

– Sq: APAFSVSPAS (window = 7)– N-gram: P, A, ..,P, PA, AF, ..SP, PAF, AFS,..,VSP

Biological Vocabulary• Context sensitive vocabulary

– Analogy• Same word might have different meanings:

e.g. “bank”• Same amino acid might have different

properties: APAFSVSPAS– Encode context semantics into the N-

gram• Record the position information in the N-gram• Example: APAFSVSPAS (window size = 7)

– Words: P-3, A-2, F-1, S+0, V+1, S+1, P+1

Text Classification• Text classification

– Analogy• The topic of a document is expressed by

the words of the document• The structure of one residue can be

inferred from the biological words nearby– High Accuracy– Text Classification Technique

• Doc to Vectors:• Classifiers: Support Vector Machines

)log()]log(1[)(frequencydocument

Nfrequencywordwordtw

TXTpred MethodSettings:

Window = 17One-gram, two-gramFeature Num = 3000

Evaluation Measure• Q3 (accuracy)

• Precision, Recall

• Segment Overlap quantity (SOV)

• Matthew’s Correlation coefficients

)1()2;1(

)2;1()2;1(1)2,1(iS

SLENSSMAXOV

SSDELTASSMINOVN

))(()()( iiiiiiii

iiiii onunopup

P + P-T + P uT - o n

uonpnpQ

oppQ pre

Experimental Results• RS126 datasets

• CB513 datasets

Biological language Properties

Power Law?

One-gram Two-gram

Term Frequency = f(Rank)

Sequence Analysis -1Feature Selection

• Top ten Discriminating features for Helix

• Verification by Chou-Fasman parameters– Helix favors A, E, M,

L, K (top 5 amino acids)

– disfavors P (top 1 amino acid)

• Top ten Discriminating features for Sheet

• Verification by Chou-Fasman parameters– Sheets favors V, I,

Y, F, W (top 5 amino acids)

– Disfavors D, E (top 2 amino acids)

• Top ten Discriminating features for Coil

• Verification by Chou-Fasman parameters– Coil favors N, P, G,

D, S (top 5 amino acids)

– Disfavors V, I, L (top 3 amino acids)

Sequence Analysis –2Word Correlation

• Word correlation • Some words have strong correlation and

co-occur frequently • Technique: Singular Vector

Decomposition• Examples from texts

• Phrases: {president, Bush}• Semantic correlated: {Olympic, sports}

Sequence Analysis – 2 Word Correlation

• Top ten correlated word pairs

Sequence Analysis – 2 Word Correlation

Regular Expression

ProteinSequence

Secondary

Structure

Conjecture

CPXXAI Sq1:ECPNEAIMSq2:ECPAEAIKSq3:GCPI PAIL

L1: HCCCCCECL2: HCCCCCEEL3: CCCCCEEE

Coil connected to Sheet

PGH Sq1: TFPGHSASq2: DCPGHAD

L1: CCCCCCCL2: ECCCHHH

EEL Sq1: DDEELLESq2: WSEELNS

L1:CCHHHHHL2:CCHHHHH

Conclusion• TXTpred Summary

– Context sensitive biological vocabulary– Novel application of text classification to

secondary structure prediction– Comparable performance for secondary

structure prediction– Analysis provides reasonable biological

meanings and structure indicators

Future Work• Deeper study on extracting more

meaningful biological vocabulary• Further discovery of new features,

such as torsion angle and free energy

• Advanced learning models to consider long-range interactions

• Conditional random fields, Maximum entropy markov model

Acknowledgement

• Vanathi Gopalakrishnan, Upitt

• Ivet Barhar, UPitt

Motivation for 2-D prediction

• Basis for three-dimensional structure prediction

• Improving other sequence and structure analysis– Sequence alignment– Threading and homologous modeling– Experimental data– Protein design

TXTpred: A New Method for Protein Secondary Structure Prediction

Documents

What’s next ?? Today 3.3 Protein function 10.3 Protein secondary structure prediction 17.3 Protein tertiary structure prediction 24.3Gene expression &

Protein Secondary and Tertiary Structure Prediction · Protein Secondary and Tertiary Structure Prediction Steve W. Lockless steve.lockless@rockefeller.edu. The Sequence of Amino

Lecture 2 Protein secondary structure prediction · Lecture 2 Protein secondary structure prediction Computational Aspects of Molecular Structure Teresa Przytycka, PhD

Secondary Protein Structure Prediction

Protein Secondary Structure Prediction › Leere › SS11 › Bioinf2 › VL06.pdf · Protein Secondary Structure Prediction the goal is the prediction of the secondary structure

Protein Secondary Structure Prediction

Bayesian Model of Protein Primary Sequence for Secondary ...marina/papers/journal.pone.0109832.pdf · Bayesian Model of Protein Primary Sequence for Secondary Structure Prediction

Protein Secondary Structure Prediction PSSP

Neural network for protein secondary structure prediction

Secondary structure prediction cb1 sec1 lecture: Protein Prediction 1 - Protein ... · 2014. 5. 22. · Secondary structure prediction: 1.+2. Generation 32 single residues (1. generation)

Protein secondary structure prediction methods

Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Some gory details of protein secondary structure prediction

COMP 564: Protein Secondary Structure Prediction › ~jeromew › teaching › 564 › W... · Protein Secondary Structure Prediction Using Statistical Models • Sequences determine

PREDICTING PROTEIN SECONDARY STRUCTURE USING …eprints.utm.my/id/eprint/4309/1/SaadOsmanAbdallaPFSKSM2005.pdf · Protein secondary structure prediction is a prerequisite step in

Protein structure Predictive methods. Topics Covered Secondary structure prediction methods 3D fold prediction –Ab initio protein structure prediction

Prediction of protein disorder - aidanbudd.github.io · IDP prediction and other 1D prediction methods Secondary structure prediction methods Coil is an ordered, irregular structural

Protein Secondary Structure Prediction: Novel Methods and ... · Protein Secondary Structure Prediction: Novel Methods and ... Protein Secondary Structure Prediction: Novel Methods

Protein Prediction - Part 1: Structure · 2011. 5. 26. · Today: Secondary structure prediction 1 LAST WEEKs • Secondary structure prediction: principles on white board THIS WEEK

Protein Structure Prediction Using Neural Networks · Protein Secondary Structure Prediction Based on Denoeux Belief Neural Network •Purpose –Using neural nets, effectively predict