Download pdf - Protein tertiary structure prediction with new machine learning …rkuang/paper/intern.pdf · 2005. 9. 30. · Protein tertiary structure prediction with new machine learning approaches

Protein tertiary structure prediction Protein tertiary structure prediction with new machine learning approacheswith new machine learning approaches

RuiRui KuangKuang

Department of Computer ScienceDepartment of Computer ScienceColumbia UniversityColumbia University

Supervisor: Jason Weston(NEC) and Christina Leslie(Columbia) Supervisor: Jason Weston(NEC) and Christina Leslie(Columbia)

NEC summer internship talk, August 30th, 2005NEC summer internship talk, August 30th, 2005

AgendaAgenda

1. Introduction to protein structure2. Protein backbone angle prediction

with structured output learning3. Protein domain detection based on

protein structural classification4. Discussion

1. Protein structure 2. Backbone angle prediction3. Domain detection4. Discussion

Part 1: Protein structurePart 1: Protein structure• Protein –

Derived from Greek word proteios meaning “of the first rank” in 1838 by Jöns J. Berzelius

• Crucial in all biological processes

• Function depends on structure (structure can help us to understand function)

• Determination of protein structures is time consuming and expensive

How to describe protein How to describe protein structurestructure

• Primary structure: amino acid sequence• Secondary structure: local structure elements • Tertiary structure: packing and arrangement of

secondary structure, also called domain• Quaternary structure: arrangement of several

polypeptide chains

Describe protein Describe protein tertiary structure structure by protein backbone anglesby protein backbone angles

Phi-Psi Angles

……

(Φ1,Ψ1)

(Φ2,Ψ2)

(Φ3,Ψ3)

(Φ4,Ψ4)

(Φ5,Ψ5)

(Φ6,Ψ6)

(Φ7,Ψ7)

(Φ8,Ψ8)

……

Simplify

3-D structure

(Too complicated to predict!)

Oliver et al.(Journal of Molecular Biology, 1997)

DiscretizationDiscretization of Phiof Phi--PsiPsi angles:angles:conformational statesconformational states

De Brevern et al.(Protein Science, 2002)

Protein blocksProtein blocks16 small prototypes (a-p) of local protein structures of 5 residue length, clustered from Phi-Psi angles

Phi-Psi Angles

……

(Φ1,Ψ1)

(Φ2,Ψ2)

(Φ3,Ψ3)

(Φ4,Ψ4)

(Φ5,Ψ5)

(Φ6,Ψ6)

(Φ7,Ψ7)

(Φ8,Ψ8)

……

Protein Blocks

……

g

l

p

d

b

b

m

……

Summary: representations Summary: representations of 3of 3--D protein structure D protein structure

Phi-Psi Angles

(Φ1,Ψ1)

(Φ2,Ψ2)

(Φ3,Ψ3)

(Φ4,Ψ4)

(Φ5,Ψ5)

(Φ6,Ψ6)

(Φ7,Ψ7)

(Φ8,Ψ8)

……

3-D structure Conformational States:

AAAGBBBBBBGEBBBB…

Protein Blocks:

ammmalpppmmlmlbb…

Protein domainsProtein domains

• A polypeptide chain or a part of a polypeptide chain that can fold independently into a stable tertiary structure.

1. Protein structure 2. Backbone angle prediction3. Domain detection4. Discussion

Part 2: Part 2: Prediction of protein backbone angle

with structured output learning

Kuang, Leslie and Yang et al.(Bioinformatics, 2004)

NaNaïïve windowve window--based approachbased approachEncode each position independently with sequence information within a length-k window.

Conformational

States

A

A

A

B

B

B

B

G

G

E

B

B

B

B

B

A:-3 –4 –4 –4 –3 –4…..

A:0 –1 –1 3 –4 3 4 1…..

B:0 –1 2 1 –3 4 0 –1……

B:-2 –3 –4 –5 –2 4……

B:0 –3 –1 –2 –4 –1……

……

ToSVM

Predictions are independent.

We are neighbors! We have

dependency!

22--Stage windowStage window--based approachbased approach• Take the prediction of the naïve window-based

approach as input to a second sets of SVMs. • Ideally this smoothing step can correct some

wrong predictions.

Window-basedmapping

Naïve window-based approach

Sequence profiles SVM SVMA B G E

0.1 0.3 0.4 0.2

0.8 0.1 0.1 0.0

0.1 0.5 0.4 0.0

0.3 0.5 0.1 0.1……

Prediction profiles

Window-basedmapping

Final Predictions

AABGBEGE………

Predictions

A second stage of smoothing

Mohr and Obermayer et al. (NIPS ,2004)

Topographic SVMTopographic SVM

Sequence profiles

A B G E

1.0 0.0 0.0 0.0

1.0 0.0 0.0 0.0

0.0 1.0 0.0 0.0

0.0 1.0 0.0 0.0……

True Label profiles

+ Window-basedmapping

SVM

Training

Update predictions

A B G E

0.1 0.3 0.4 0.2

0.8 0.1 0.1 0.0

0.1 0.5 0.4 0.0

0.3 0.5 0.1 0.1……

Prediction profiles

Sequence profiles

+Window-based

mapping

Testing

From a base

approach for

the first round

• Training with profiles+true labels• iteratively update the predictions in the testing phase.

Tsochantaridis(ICML, 2004)

StructStruct--SVMSVM

•Testing: a pre-image problem

•Training: make joint feature mapping and apply large margin principle for the difference between the feature mapping of correct label and of wrong label.

•This is equivalent to the following optimization problem

Altun et al. (ICML 2003)

PrePre--image for Labeling image for Labeling SequencesSequences

• Hidden-Markov kernel

• Pre-image is equivalent to Viterbi-decoding of a HMM built from support vectors

Preliminary ResultsPreliminary Results•Prediction of Conformational States: 697 sequences of 97,365 amino acids with sequence identity < 25 %

•Prediction of Protein Blocks: 675 sequences of 146,978 amino acids with sequence identity < 30 %

50%>70.0%>SVM for structured output

58.4%75.3%Topographic SVM

59.5%>76.0%>2-Stage window-based approach

57.7%75.0%Naïve window-based approach

40.3%75.0%State of art

Accuracy(Protein Blocks)

Accuracy(Conformational States)

Methods

1. Introduction 2. Backbone angle prediction3. Domain detection4. Discussion

Part 3: Part 3: Protein domain detection based on Protein domain detection based on

protein structural classificationprotein structural classification

Murzin et al. (Journal of Molecular Biology, 1995)

Protein structural classificationProtein structural classification

SCOP

Fold

Superfamily

FamilyPositive Training Set

Positive Test Set

Negative Training Set

Negative Test Set

Family : Sequence identity > 30% or functions and structures are very similarSuperfamily : low sequence similarity but functional features suggest probable common evolutionary originCommon fold : same major secondary structures in the same arrangement with the same topological connections

Leslie et al. (PSB, 2002)

Spectrum kernelSpectrum kernel• Feature map indexed by all possible k-length

subsequences (“k-mers”) from alphabet Σ of amino acids, |Σ| = 20

Q1:AKQDYYYYE

AKQKQDQDYDYYYYYYYYYYE

DYYYYEYEIEIAIAKAKQKQY

Feature Space(AAA-YYY)1 AKQ 11 DYY 1 0 EIA 10 IAK 11 KQD 00 KQY 1 1 QDY 0 0 YEI 11 YYE 12 YYY 0

Q2:DYYEIAKQY

K(Q1,Q2)=<(…1…1…0…0…1…0…1…0…1…2),(…1…1…1…1…0…1…0…1…1…0)>=3

Kuang and Leslie et al.(JBCB, 2005)

Profile kernelProfile kernel• Use profile to define position-dependent mutation neighborhoods:

• E.g. k=3, σ=5 and a profile of negative log probabilities

P(x) = p j (b),b∈ Σ, j =1K x{ }

AKQYKQ(2+1+1<σ) AKQ

(1+1+1<σ)AKC

(1+1+2<σ)

YKC(2+1+1<σ)

( 0 , … , 1 , … , 1 , … , 1 , … , 1 , … , 0 )AKC AKQ YKC YKQ

AKQ

A K Q …A 1 3 4 …C 5 4 1 …D 4 4 4 …… … … … …K 4 1 4 …… … … … …Q 3 4 1 …… … … … …Y 2 4 3 …

M k ,σ( ) P x j + 1: j + k[ ]( )( )=b1b2Lbk :− log p j + i bi( )( )i∑ <σ{ }

Positional classification scoresPositional classification scores

PDB ID: 1f2e PDB ID:1hnf

A simple probabilistic model to detect domains:

)|||,1(*)|,(*...*)|1,1(*)|,(

*)|1,1(*)|,(*)|1,()|,(

3222

211110

FFePFesPFSePFesP

FsePFesPFssPFESP

nnn +−+

−+−=

ExperimentsExperiments1.Dataset

• 7,329 sequences from SCOP 1.59.

• Sequence identity less than 95%.

2.Preliminary Results (with a simplified model)

21.9%31.1%Domain end

36.0%51.1%Domain start

73.1%73.2%Domain positions

CoverageAccuracyCriteria

1. Introduction 2. Backbone angle prediction3. Domain detection4. Discussion

Part 4: DiscussionPart 4: Discussion

•Dependency between conformational states or protein blocks does not help much in the 2-stage window-based approach.

•Struct-SVM does not scale very well for large problems. Perceptron training may speed up the training stage.

•A proper probabilistic model is needed for detecting domain boundaries from positional classification scores

AcknowledgementAcknowledgement•William Stafford Noble

Genome Science Department, University of Washington•Asa Ben-Hur

Genome Science Department, University of Washington

•An-Suei YangGenome Research Center, Academia Sinica of Taiwan

•Yasemin AltunToyota Technological Institute at Chicago

•Thorsten JoachimsComputer Science Department, Cornell University