27
Carnegie Mellon School of Computer Science Biological Language Modeling Project Copyright © 2003, Carnegie Mellon. All Rights Reserved. TXTpred: A New Method for Protein Secondary Structure Prediction Yan Liu, Jaime Carbonell, Judith Klein- Seetharaman School of Computer Science Carnegie Mellon University May 14, 2003

TXTpred: A New Method for Protein Secondary Structure Prediction

  • Upload
    dagmar

  • View
    28

  • Download
    3

Embed Size (px)

DESCRIPTION

TXTpred: A New Method for Protein Secondary Structure Prediction. Yan Liu, Jaime Carbonell, Judith Klein- Seetharaman School of Computer Science Carnegie Mellon University May 14, 2003. Roadmap. Overview on secondary structure prediction Description of TXTpred method - PowerPoint PPT Presentation

Citation preview

Page 1: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

TXTpred: A New Method for Protein Secondary Structure Prediction

Yan Liu, Jaime Carbonell, Judith Klein- SeetharamanSchool of Computer ScienceCarnegie Mellon University

May 14, 2003

Page 2: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Roadmap• Overview on secondary structure

prediction• Description of TXTpred method• Experiment results and analysis• Discussion and further work

Page 3: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Secondary Structure of a Protein Sequence

• Dictionary of Secondary Structure Prediction annotates each residue with its structure (DSSP)– based on hydrogen bonding patterns

and geometrical constraints • 7 DSSP labels for PSS:

– Helix types: H G (alpha-helix 3/10 helix)– Sheet types: B E (isolated beta-bridge

strand)– Coil types: T _ S (Coil)

Page 4: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Secondary Structure of a Protein Sequence

• Accuracy Limit ~ 88%

Page 5: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Task Definition• Given a protein sequence:

– APAFSVSPASGA• Predict its secondary structure

sequence:– CCEEEEECCCCC– Focus on soluble proteins, not on

membrane protein

Page 6: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Overview of Previous Work -1

• 1st-generation method – Calculate propensities for each amino acid

• E.g. Chou-Fasman method (Chou & Fasman, 1974)• 2nd-generation method

– “Window” concept• APAFSVSPAS (window size = 7)

– Calculate propensities for segments of 3-51 amino acids

• E.g. GOR method (Garnier et al, 1978)

Page 7: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Overview of Previous Work -2

• 3rd-generation method– Use evolutional information multiple

sequence alignment• p-Value cut-off = 10-2 • PHD: Neural Network & Sequence features only (Rost &

Sander, 1993)• DSC: LDA & Biological features: GOR, hydrophobicity

etc. (King & Sternberg, 1996)– Later Refinement

• Apply divergent sequence alignment: e.g. PROF (Ouali & King, 2000)

• Combine results of different system: e.g. Jpred (Cuff & Barton, 1999)

• Bayesian Segmentation (Schmidler et al, 1999)

Page 8: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Summary of Performance

Method Name Performance (Q3)Chou-Fasman ~ 50%

GOR ~ 56%PHD ~ 71%DSC ~ 70%

Page 9: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Disadvantage of Previous Work

• Most are “black box” predictors– Weak biological meanings

• Little focus on long-range interaction– Mostly focused on local information

• Performance is asymptotically bounded

Page 10: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Roadmap• Overview on secondary structure

prediction• Description of TXTpred method• Experiment results and analysis• Discussion and further work

Page 11: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

TXTpred• Basic idea:

– Build meaningful biological vocabulary – Apply language technique for prediction

• Major challenge:– How to build the vocabulary?

• Context-free N-gram of amino acids inside the window

– Sq: APAFSVSPAS (window = 7)– N-gram: P, A, ..,P, PA, AF, ..SP, PAF, AFS,..,VSP

Page 12: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Biological Vocabulary• Context sensitive vocabulary

– Analogy• Same word might have different meanings:

e.g. “bank”• Same amino acid might have different

properties: APAFSVSPAS– Encode context semantics into the N-

gram• Record the position information in the N-gram• Example: APAFSVSPAS (window size = 7)

– Words: P-3, A-2, F-1, S+0, V+1, S+1, P+1

Page 13: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Text Classification• Text classification

– Analogy• The topic of a document is expressed by

the words of the document• The structure of one residue can be

inferred from the biological words nearby– High Accuracy– Text Classification Technique

• Doc to Vectors:• Classifiers: Support Vector Machines

)log()]log(1[)(frequencydocument

Nfrequencywordwordtw

Page 14: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

TXTpred MethodSettings:

Window = 17One-gram, two-gramFeature Num = 3000

Page 15: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Evaluation Measure• Q3 (accuracy)

• Precision, Recall

• Segment Overlap quantity (SOV)

• Matthew’s Correlation coefficients

)(

)1()2;1(

)2;1()2;1(1)2,1(iS

SLENSSMAXOV

SSDELTASSMINOVN

SSSOV

))(()()( iiiiiiii

iiiii onunopup

ounpC

P + P-T + P uT - o n

uonpnpQ

3

oppQ pre

uppQ

Page 16: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Experimental Results• RS126 datasets

• CB513 datasets

Page 17: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Biological language Properties

Power Law?

One-gram Two-gram

Term Frequency = f(Rank)

Page 18: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Sequence Analysis -1Feature Selection

• Top ten Discriminating features for Helix

• Verification by Chou-Fasman parameters– Helix favors A, E, M,

L, K (top 5 amino acids)

– disfavors P (top 1 amino acid)

Page 19: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Sequence Analysis -1Feature Selection

• Top ten Discriminating features for Sheet

• Verification by Chou-Fasman parameters– Sheets favors V, I,

Y, F, W (top 5 amino acids)

– Disfavors D, E (top 2 amino acids)

Page 20: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Sequence Analysis -1Feature Selection

• Top ten Discriminating features for Coil

• Verification by Chou-Fasman parameters– Coil favors N, P, G,

D, S (top 5 amino acids)

– Disfavors V, I, L (top 3 amino acids)

Page 21: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Sequence Analysis –2Word Correlation

• Word correlation • Some words have strong correlation and

co-occur frequently • Technique: Singular Vector

Decomposition• Examples from texts

• Phrases: {president, Bush}• Semantic correlated: {Olympic, sports}

Page 22: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Sequence Analysis – 2 Word Correlation

• Top ten correlated word pairs

Page 23: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Sequence Analysis – 2 Word Correlation

Regular Expression

ProteinSequence

Secondary

Structure

Conjecture

CPXXAI Sq1:ECPNEAIMSq2:ECPAEAIKSq3:GCPI PAIL

L1: HCCCCCECL2: HCCCCCEEL3: CCCCCEEE

Coil connected to Sheet

PGH Sq1: TFPGHSASq2: DCPGHAD

L1: CCCCCCCL2: ECCCHHH

Coil

EEL Sq1: DDEELLESq2: WSEELNS

L1:CCHHHHHL2:CCHHHHH

Helix

Page 24: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Conclusion• TXTpred Summary

– Context sensitive biological vocabulary– Novel application of text classification to

secondary structure prediction– Comparable performance for secondary

structure prediction– Analysis provides reasonable biological

meanings and structure indicators

Page 25: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Future Work• Deeper study on extracting more

meaningful biological vocabulary• Further discovery of new features,

such as torsion angle and free energy

• Advanced learning models to consider long-range interactions

• Conditional random fields, Maximum entropy markov model

Page 26: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Acknowledgement

• Vanathi Gopalakrishnan, Upitt

• Ivet Barhar, UPitt

Page 27: TXTpred: A New Method for Protein Secondary Structure Prediction

Carnegie MellonSchool of Computer Science Biological Language Modeling

ProjectCopyright © 2003, Carnegie Mellon. All Rights Reserved.

Motivation for 2-D prediction

• Basis for three-dimensional structure prediction

• Improving other sequence and structure analysis– Sequence alignment– Threading and homologous modeling– Experimental data– Protein design