22
Carnegie Mellon School of Computer Science 1 Protein Tertiary and Quaternary Fold Recognition: A ML Approach Jaime Carbonell Joint work with: Yan Liu(IBM), Vanathi Gopalakrishnan (U Pitt), Peter Weigele (MIT) Language Technologies Institute Carnegie Mellon University Machine Learning Lunch – 11-April-2007

Protein Tertiary and Quaternary Fold Recognition: A ML Approach

  • Upload
    fadhila

  • View
    37

  • Download
    3

Embed Size (px)

DESCRIPTION

Protein Tertiary and Quaternary Fold Recognition: A ML Approach. Jaime Carbonell Joint work with: Yan Liu( IBM ), V anathi Gopalakrishnan (U Pitt), Peter Weigele (MIT) Language Technologies Institute Carnegie Mellon University Machine Learning Lunch – 11-April-2007. Nobelprize.org. - PowerPoint PPT Presentation

Citation preview

Page 1: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

1

Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Jaime CarbonellJoint work with:

Yan Liu(IBM), Vanathi Gopalakrishnan (U Pitt), Peter Weigele (MIT)Language Technologies Institute

Carnegie Mellon University

Machine Learning Lunch – 11-April-2007

Page 2: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

2

Snapshot of Cell BiologyNobelprize.org

+

Protein function

DSCTFTTAAAAKAGKAKAG

Protein sequence

Protein structure

Page 3: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

3

Primary SequenceMNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA

3D Structure

Folding

Complex function within network of proteins

Normal

PROTEINSSequence Structure Function

(Borrowed from: Judith Klein-Seetharaman)

Page 4: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

4

Primary SequenceMNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA

3D Structure

Folding

Complex function within network of proteins

Disease

PROTEINSSequence Structure Function

Page 5: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

5

Example Protein Structures

Adenovirus Fibre Shaft Virus Capsid

Triple beta-spiral fold in Adenovirus Fiber Shaft

Page 6: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

6

Predicting Protein Structures• Protein Structure is a key determinant of protein function

• Crystalography to resolve protein structures experimentally in-vitro is very expensive, NMR can only resolve very-small proteins

• The gap between the known protein sequences and structures: 3,023,461 sequences v.s. 36,247 resolved structures (1.2%) Therefore we need to predict structures in-silico

Page 7: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

7

Quaternary Folds and Alignments

• Protein fold Identifiable regular arrangement of secondary structural elements

• Thus far, a limited number of protein folds have been discovered (~1000)

Very few research work on quaternary folds • Complex structures and few labeled data

• Quaternary fold recognition

Seq 1: APA FSVSPA … SGACGP ECAESGSeq 2 : DSCTFT…TAAAAKAGKAKCSTITL

Biology task Protein fold Membership and non-membership proteins

Will the protein take the fold?

AI task Pattern to be induced

Training data (seq-struc pairs + physics)

Does the pattern appear in the testing sequence?

Page 8: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

8

Previous Work• Sequence similarity perspective

Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997]

Profile HMM, .e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998]

Window-based methods, e.g. PSI_pred [Jones, 2001]

• Physical forces perspective Homology modeling or threading, e.g. Threader [Jones, 1998]

• Structural biology perspective Painstakingly hand-engineered methods for specific structures, e.g. αα- and ββ- hairpins, β-turn and

β-helix [Efimov, 1991; Wilmot and Thornton, 1990; Bradley at al, 2001]

Generative models based on rough approximation of free-energy, perform very poorly on complex structures

Very Hard to generalize due to built-in constants, fixed features

Fail to capture the structure properties and long-range dependencies

Page 9: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

9

Conditional Random Fields• Hidden Markov model (HMM) [Rabiner, 1989]

• Conditional random fields (CRFs) [Lafferty et al, 2001]

Model conditional probability directly (discriminative models, directly optimizable)

Allow arbitrary dependencies in observation Adaptive to different loss functions and

regularizers Promising results in multiple applications But, need to scale up (computationally) and

extend to long-distance dependencies

11

( ) ( | ) ( | )N

i i i ii

P P x y P y y

x, y

11 10

1( ) exp( ( , , , ))

N K

k k i ii k

P f i y yZ

y | x x

1 1( , , , ) ' ( , ) ( , ')k i i k i if i y y f i I y s y s x x

Page 10: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

10

• Outputs Y = {M, {Wi} }, where Wi = {pi, qi, si}

• Feature definition Node feature

Local interaction feature

Long-range interaction feature

Our Solution: Conditional Graphical Models

1 1 1( , , ) ( , ', 1)k i i i i i if w w x I s s s s p q

( , ) '( , , ) ( ', 1 ')k i k i i i i if w x f x p q I s s q p d

Long-range dependencyLocal dependency

1( , , ) '( , , , , ) ( , ')k i j k i i j j i if w w x g x p q p q I s s s s

Page 11: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

11

Linked Segmentation CRF

• Node: secondary structure elements and/or simple fold• Edges: Local interactions and long-range inter-chain and

intra-chain interactions• L-SCRF: conditional probability of y given x is defined as

, , ,

1 1 , , ,,

1( ,..., | ,..., ) exp( ( , )) exp( ( , , , ))

i j G i j a b G

R R k k i i j l k i a i j a bV k lE

P f g yZ

y y y

y y x x x y x x y

Joint Labels

Page 12: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

12

• Classification:

• Training : learn the model parameters λ Minimizing regularized negative log loss

Iterative search algorithms by seeking the direction whose empirical values agree with the expectation

• Complex graphs results in huge computational complexity

Linked Segmentation CRF (II)

( | )( ( , ) [ ( , )]) ( ) 0G

k c p k cc Ck

Lf E f

y xx y x y

21

( , ) log ( )G

K

k k cc C k

L f Z

x y

1

* arg max ( , )G

K

k k cc C k

y f Y

x

Page 13: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

13

Approximate Inference of L-SCRF

• Most approximation algorithms cannot handle variable number of nodes in the graph, but we need variable graph topologies, so…

• Reversible jump MCMC sampling [Greens, 1995, Schmidler et al,

2001] with Four types of Metropolis operators State switching Position switching Segment split Segment merge

• Simulated annealing reversible jump MCMC [Andireu et al, 2000]

Replace the sample with RJ MCMC Theoretically converge on the global optimum

Page 14: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

14

Features for Protein Fold Recognition

Page 15: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

15

Tertiary Fold Recognition: β-Helix fold

• Histogram and ranks for known β-helices against PDB-minus dataset

5

Chain graph model reduces the real running time of SCRFs model by around 50 times

Page 16: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

16

Fold Alignment Prediction: β-Helix• Predicted alignment for known β -helices on cross-family

validation

Page 17: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

17

Discovery of New Potential β-helices

• Run structural predictor seeking potential β-helices from Uniprot (structurally unresolved) databases Full list (98 new predictions) can be accessed at

www.cs.cmu.edu/~yanliu/SCRF.html

• Verification on 3 proteins with later experimentally resolved structures from different organisms 1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase 1PXZ: The Major Allergen From Cedar Pollen GP14 of Shigella bacteriophage as a β-helix protein

No single false positive!

Page 18: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

18

Experiments: Target Quaternary Fold

• Triple beta-spirals [van Raaij et al. Nature 1999]

Virus fibers in adenovirus, reovirus and PRD1

• Double barrel trimer [Benson et al, 2004]

Coat protein of adenovirus, PRD1, STIV, PBCV

Page 19: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

19

Experiment Results: Fold Recognition

Double barrel-trimerTriple beta-spirals

Page 20: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

20

Experiment Results: Alignment Prediction

Triple beta-spirals

Four states: B1, B2, T1 and T2

Correct Alignment:

B1: i – o B2: a - h

Predicted Alignment

B1 B2

Page 21: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

21

Experiment Results:Discovery of New Membership Proteins

• Predicted membership proteins of triple beta-spirals can be accessed at

http://www.cs.cmu.edu/~yanliu/swissprot_list.xls

• Membership proteins of double barrel-trimer suggested by biologists [Benson, 2005] compared with L-SCRF predictions

Page 22: Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Carnegie MellonSchool of Computer Science

22

Concluding Remarks• Conditional graphical models for protein structure

prediction Effective representation for protein structural properties Feasibility to incorporate different kinds of informative

features Efficient inference algorithms for large-scale applications

• A major extension compared with previous work Knowledge representation through graphical models Ability to handle long-range interactions within one chain

and between chains

• Future work Automatic learning of graph topology Active learning – including minority-class discovery