24
Protein tertiary structure prediction Protein tertiary structure prediction with new machine learning approaches with new machine learning approaches Rui Rui Kuang Kuang Department of Computer Science Department of Computer Science Columbia University Columbia University Supervisor: Jason Weston(NEC) and Christina Leslie(Columbia) Supervisor: Jason Weston(NEC) and Christina Leslie(Columbia) NEC summer internship talk, August 30th, 2005 NEC summer internship talk, August 30th, 2005

Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

Protein tertiary structure prediction Protein tertiary structure prediction with new machine learning approacheswith new machine learning approaches

RuiRui KuangKuang

Department of Computer ScienceDepartment of Computer ScienceColumbia UniversityColumbia University

Supervisor: Jason Weston(NEC) and Christina Leslie(Columbia) Supervisor: Jason Weston(NEC) and Christina Leslie(Columbia)

NEC summer internship talk, August 30th, 2005NEC summer internship talk, August 30th, 2005

Page 2: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

AgendaAgenda

1. Introduction to protein structure2. Protein backbone angle prediction

with structured output learning3. Protein domain detection based on

protein structural classification4. Discussion

Page 3: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

1. Protein structure 2. Backbone angle prediction3. Domain detection4. Discussion

Part 1: Protein structurePart 1: Protein structure• Protein –

Derived from Greek word proteios meaning “of the first rank” in 1838 by Jöns J. Berzelius

• Crucial in all biological processes

• Function depends on structure (structure can help us to understand function)

• Determination of protein structures is time consuming and expensive

Page 4: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

How to describe protein How to describe protein structurestructure

• Primary structure: amino acid sequence• Secondary structure: local structure elements • Tertiary structure: packing and arrangement of

secondary structure, also called domain• Quaternary structure: arrangement of several

polypeptide chains

Page 5: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

Describe protein Describe protein tertiary structure structure by protein backbone anglesby protein backbone angles

Phi-Psi Angles

……

(Φ1,Ψ1)

(Φ2,Ψ2)

(Φ3,Ψ3)

(Φ4,Ψ4)

(Φ5,Ψ5)

(Φ6,Ψ6)

(Φ7,Ψ7)

(Φ8,Ψ8)

……

Simplify

3-D structure

(Too complicated to predict!)

Page 6: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

Oliver et al.(Journal of Molecular Biology, 1997)

DiscretizationDiscretization of Phiof Phi--PsiPsi angles:angles:conformational statesconformational states

Page 7: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

De Brevern et al.(Protein Science, 2002)

Protein blocksProtein blocks16 small prototypes (a-p) of local protein structures of 5 residue length, clustered from Phi-Psi angles

Phi-Psi Angles

……

(Φ1,Ψ1)

(Φ2,Ψ2)

(Φ3,Ψ3)

(Φ4,Ψ4)

(Φ5,Ψ5)

(Φ6,Ψ6)

(Φ7,Ψ7)

(Φ8,Ψ8)

……

Protein Blocks

……

g

l

p

d

b

b

m

……

Page 8: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

Summary: representations Summary: representations of 3of 3--D protein structure D protein structure

Phi-Psi Angles

(Φ1,Ψ1)

(Φ2,Ψ2)

(Φ3,Ψ3)

(Φ4,Ψ4)

(Φ5,Ψ5)

(Φ6,Ψ6)

(Φ7,Ψ7)

(Φ8,Ψ8)

……

3-D structure Conformational States:

AAAGBBBBBBGEBBBB…

Protein Blocks:

ammmalpppmmlmlbb…

Page 9: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

Protein domainsProtein domains

• A polypeptide chain or a part of a polypeptide chain that can fold independently into a stable tertiary structure.

Page 10: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

1. Protein structure 2. Backbone angle prediction3. Domain detection4. Discussion

Part 2: Part 2: Prediction of protein backbone angle

with structured output learning

Page 11: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

Kuang, Leslie and Yang et al.(Bioinformatics, 2004)

NaNaïïve windowve window--based approachbased approachEncode each position independently with sequence information within a length-k window.

Conformational

States

A

A

A

B

B

B

B

G

G

E

B

B

B

B

B

A:-3 –4 –4 –4 –3 –4…..

A:0 –1 –1 3 –4 3 4 1…..

B:0 –1 2 1 –3 4 0 –1……

B:-2 –3 –4 –5 –2 4……

B:0 –3 –1 –2 –4 –1……

……

ToSVM

Predictions are independent.

We are neighbors! We have

dependency!

Page 12: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

22--Stage windowStage window--based approachbased approach• Take the prediction of the naïve window-based

approach as input to a second sets of SVMs. • Ideally this smoothing step can correct some

wrong predictions.

Window-basedmapping

Naïve window-based approach

Sequence profiles SVM SVMA B G E

0.1 0.3 0.4 0.2

0.8 0.1 0.1 0.0

0.1 0.5 0.4 0.0

0.3 0.5 0.1 0.1……

Prediction profiles

Window-basedmapping

Final Predictions

AABGBEGE………

Predictions

A second stage of smoothing

Page 13: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

Mohr and Obermayer et al. (NIPS ,2004)

Topographic SVMTopographic SVM

Sequence profiles

A B G E

1.0 0.0 0.0 0.0

1.0 0.0 0.0 0.0

0.0 1.0 0.0 0.0

0.0 1.0 0.0 0.0……

True Label profiles

+ Window-basedmapping

SVM

Training

Update predictions

A B G E

0.1 0.3 0.4 0.2

0.8 0.1 0.1 0.0

0.1 0.5 0.4 0.0

0.3 0.5 0.1 0.1……

Prediction profiles

Sequence profiles

+Window-based

mapping

Testing

From a base

approach for

the first round

• Training with profiles+true labels• iteratively update the predictions in the testing phase.

Page 14: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

Tsochantaridis(ICML, 2004)

StructStruct--SVMSVM

•Testing: a pre-image problem

•Training: make joint feature mapping and apply large margin principle for the difference between the feature mapping of correct label and of wrong label.

•This is equivalent to the following optimization problem

Page 15: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

Altun et al. (ICML 2003)

PrePre--image for Labeling image for Labeling SequencesSequences

• Hidden-Markov kernel

• Pre-image is equivalent to Viterbi-decoding of a HMM built from support vectors

Page 16: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

Preliminary ResultsPreliminary Results•Prediction of Conformational States: 697 sequences of 97,365 amino acids with sequence identity < 25 %

•Prediction of Protein Blocks: 675 sequences of 146,978 amino acids with sequence identity < 30 %

50%>70.0%>SVM for structured output

58.4%75.3%Topographic SVM

59.5%>76.0%>2-Stage window-based approach

57.7%75.0%Naïve window-based approach

40.3%75.0%State of art

Accuracy(Protein Blocks)

Accuracy(Conformational States)

Methods

Page 17: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

1. Introduction 2. Backbone angle prediction3. Domain detection4. Discussion

Part 3: Part 3: Protein domain detection based on Protein domain detection based on

protein structural classificationprotein structural classification

Page 18: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

Murzin et al. (Journal of Molecular Biology, 1995)

Protein structural classificationProtein structural classification

SCOP

Fold

Superfamily

FamilyPositive Training Set

Positive Test Set

Negative Training Set

Negative Test Set

Family : Sequence identity > 30% or functions and structures are very similarSuperfamily : low sequence similarity but functional features suggest probable common evolutionary originCommon fold : same major secondary structures in the same arrangement with the same topological connections

Page 19: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

Leslie et al. (PSB, 2002)

Spectrum kernelSpectrum kernel• Feature map indexed by all possible k-length

subsequences (“k-mers”) from alphabet Σ of amino acids, |Σ| = 20

Q1:AKQDYYYYE

AKQKQDQDYDYYYYYYYYYYE

DYYYYEYEIEIAIAKAKQKQY

Feature Space(AAA-YYY)1 AKQ 11 DYY 1 0 EIA 10 IAK 11 KQD 00 KQY 1 1 QDY 0 0 YEI 11 YYE 12 YYY 0

Q2:DYYEIAKQY

K(Q1,Q2)=<(…1…1…0…0…1…0…1…0…1…2),(…1…1…1…1…0…1…0…1…1…0)>=3

Page 20: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

Kuang and Leslie et al.(JBCB, 2005)

Profile kernelProfile kernel• Use profile to define position-dependent mutation neighborhoods:

• E.g. k=3, σ=5 and a profile of negative log probabilities

P(x) = p j (b),b∈ Σ, j =1K x{ }

AKQYKQ(2+1+1<σ) AKQ

(1+1+1<σ)AKC

(1+1+2<σ)

YKC(2+1+1<σ)

( 0 , … , 1 , … , 1 , … , 1 , … , 1 , … , 0 )AKC AKQ YKC YKQ

AKQ

A K Q …A 1 3 4 …C 5 4 1 …D 4 4 4 …… … … … …K 4 1 4 …… … … … …Q 3 4 1 …… … … … …Y 2 4 3 …

M k ,σ( ) P x j + 1: j + k[ ]( )( )=b1b2Lbk :− log p j + i bi( )( )i∑ <σ{ }

Page 21: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

Positional classification scoresPositional classification scores

PDB ID: 1f2e PDB ID:1hnf

A simple probabilistic model to detect domains:

)|||,1(*)|,(*...*)|1,1(*)|,(

*)|1,1(*)|,(*)|1,()|,(

3222

211110

FFePFesPFSePFesP

FsePFesPFssPFESP

nnn +−+

−+−=

Page 22: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

ExperimentsExperiments1.Dataset

• 7,329 sequences from SCOP 1.59.

• Sequence identity less than 95%.

2.Preliminary Results (with a simplified model)

21.9%31.1%Domain end

36.0%51.1%Domain start

73.1%73.2%Domain positions

CoverageAccuracyCriteria

Page 23: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

1. Introduction 2. Backbone angle prediction3. Domain detection4. Discussion

Part 4: DiscussionPart 4: Discussion

•Dependency between conformational states or protein blocks does not help much in the 2-stage window-based approach.

•Struct-SVM does not scale very well for large problems. Perceptron training may speed up the training stage.

•A proper probabilistic model is needed for detecting domain boundaries from positional classification scores

Page 24: Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

AcknowledgementAcknowledgement•William Stafford Noble

Genome Science Department, University of Washington•Asa Ben-Hur

Genome Science Department, University of Washington

•An-Suei YangGenome Research Center, Academia Sinica of Taiwan

•Yasemin AltunToyota Technological Institute at Chicago

•Thorsten JoachimsComputer Science Department, Cornell University