Protein tertiary structure prediction Protein tertiary structure prediction with new machine learning approacheswith new machine learning approaches
RuiRui KuangKuang
Department of Computer ScienceDepartment of Computer ScienceColumbia UniversityColumbia University
Supervisor: Jason Weston(NEC) and Christina Leslie(Columbia) Supervisor: Jason Weston(NEC) and Christina Leslie(Columbia)
NEC summer internship talk, August 30th, 2005NEC summer internship talk, August 30th, 2005
AgendaAgenda
1. Introduction to protein structure2. Protein backbone angle prediction
with structured output learning3. Protein domain detection based on
protein structural classification4. Discussion
1. Protein structure 2. Backbone angle prediction3. Domain detection4. Discussion
Part 1: Protein structurePart 1: Protein structure• Protein –
Derived from Greek word proteios meaning “of the first rank” in 1838 by Jöns J. Berzelius
• Crucial in all biological processes
• Function depends on structure (structure can help us to understand function)
• Determination of protein structures is time consuming and expensive
How to describe protein How to describe protein structurestructure
• Primary structure: amino acid sequence• Secondary structure: local structure elements • Tertiary structure: packing and arrangement of
secondary structure, also called domain• Quaternary structure: arrangement of several
polypeptide chains
Describe protein Describe protein tertiary structure structure by protein backbone anglesby protein backbone angles
Phi-Psi Angles
……
(Φ1,Ψ1)
(Φ2,Ψ2)
(Φ3,Ψ3)
(Φ4,Ψ4)
(Φ5,Ψ5)
(Φ6,Ψ6)
(Φ7,Ψ7)
(Φ8,Ψ8)
……
Simplify
3-D structure
(Too complicated to predict!)
Oliver et al.(Journal of Molecular Biology, 1997)
DiscretizationDiscretization of Phiof Phi--PsiPsi angles:angles:conformational statesconformational states
De Brevern et al.(Protein Science, 2002)
Protein blocksProtein blocks16 small prototypes (a-p) of local protein structures of 5 residue length, clustered from Phi-Psi angles
Phi-Psi Angles
……
(Φ1,Ψ1)
(Φ2,Ψ2)
(Φ3,Ψ3)
(Φ4,Ψ4)
(Φ5,Ψ5)
(Φ6,Ψ6)
(Φ7,Ψ7)
(Φ8,Ψ8)
……
Protein Blocks
……
g
l
p
d
b
b
m
……
Summary: representations Summary: representations of 3of 3--D protein structure D protein structure
Phi-Psi Angles
(Φ1,Ψ1)
(Φ2,Ψ2)
(Φ3,Ψ3)
(Φ4,Ψ4)
(Φ5,Ψ5)
(Φ6,Ψ6)
(Φ7,Ψ7)
(Φ8,Ψ8)
……
3-D structure Conformational States:
AAAGBBBBBBGEBBBB…
Protein Blocks:
ammmalpppmmlmlbb…
Protein domainsProtein domains
• A polypeptide chain or a part of a polypeptide chain that can fold independently into a stable tertiary structure.
1. Protein structure 2. Backbone angle prediction3. Domain detection4. Discussion
Part 2: Part 2: Prediction of protein backbone angle
with structured output learning
Kuang, Leslie and Yang et al.(Bioinformatics, 2004)
NaNaïïve windowve window--based approachbased approachEncode each position independently with sequence information within a length-k window.
Conformational
States
A
A
A
B
B
B
B
G
G
E
B
B
B
B
B
A:-3 –4 –4 –4 –3 –4…..
A:0 –1 –1 3 –4 3 4 1…..
B:0 –1 2 1 –3 4 0 –1……
B:-2 –3 –4 –5 –2 4……
B:0 –3 –1 –2 –4 –1……
……
ToSVM
Predictions are independent.
We are neighbors! We have
dependency!
22--Stage windowStage window--based approachbased approach• Take the prediction of the naïve window-based
approach as input to a second sets of SVMs. • Ideally this smoothing step can correct some
wrong predictions.
Window-basedmapping
Naïve window-based approach
Sequence profiles SVM SVMA B G E
0.1 0.3 0.4 0.2
0.8 0.1 0.1 0.0
0.1 0.5 0.4 0.0
0.3 0.5 0.1 0.1……
Prediction profiles
Window-basedmapping
Final Predictions
AABGBEGE………
Predictions
A second stage of smoothing
Mohr and Obermayer et al. (NIPS ,2004)
Topographic SVMTopographic SVM
Sequence profiles
A B G E
1.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0
0.0 1.0 0.0 0.0……
True Label profiles
+ Window-basedmapping
SVM
Training
Update predictions
A B G E
0.1 0.3 0.4 0.2
0.8 0.1 0.1 0.0
0.1 0.5 0.4 0.0
0.3 0.5 0.1 0.1……
Prediction profiles
Sequence profiles
+Window-based
mapping
Testing
From a base
approach for
the first round
• Training with profiles+true labels• iteratively update the predictions in the testing phase.
Tsochantaridis(ICML, 2004)
StructStruct--SVMSVM
•Testing: a pre-image problem
•Training: make joint feature mapping and apply large margin principle for the difference between the feature mapping of correct label and of wrong label.
•This is equivalent to the following optimization problem
Altun et al. (ICML 2003)
PrePre--image for Labeling image for Labeling SequencesSequences
• Hidden-Markov kernel
• Pre-image is equivalent to Viterbi-decoding of a HMM built from support vectors
Preliminary ResultsPreliminary Results•Prediction of Conformational States: 697 sequences of 97,365 amino acids with sequence identity < 25 %
•Prediction of Protein Blocks: 675 sequences of 146,978 amino acids with sequence identity < 30 %
50%>70.0%>SVM for structured output
58.4%75.3%Topographic SVM
59.5%>76.0%>2-Stage window-based approach
57.7%75.0%Naïve window-based approach
40.3%75.0%State of art
Accuracy(Protein Blocks)
Accuracy(Conformational States)
Methods
1. Introduction 2. Backbone angle prediction3. Domain detection4. Discussion
Part 3: Part 3: Protein domain detection based on Protein domain detection based on
protein structural classificationprotein structural classification
Murzin et al. (Journal of Molecular Biology, 1995)
Protein structural classificationProtein structural classification
SCOP
Fold
Superfamily
FamilyPositive Training Set
Positive Test Set
Negative Training Set
Negative Test Set
Family : Sequence identity > 30% or functions and structures are very similarSuperfamily : low sequence similarity but functional features suggest probable common evolutionary originCommon fold : same major secondary structures in the same arrangement with the same topological connections
Leslie et al. (PSB, 2002)
Spectrum kernelSpectrum kernel• Feature map indexed by all possible k-length
subsequences (“k-mers”) from alphabet Σ of amino acids, |Σ| = 20
Q1:AKQDYYYYE
AKQKQDQDYDYYYYYYYYYYE
DYYYYEYEIEIAIAKAKQKQY
Feature Space(AAA-YYY)1 AKQ 11 DYY 1 0 EIA 10 IAK 11 KQD 00 KQY 1 1 QDY 0 0 YEI 11 YYE 12 YYY 0
Q2:DYYEIAKQY
K(Q1,Q2)=<(…1…1…0…0…1…0…1…0…1…2),(…1…1…1…1…0…1…0…1…1…0)>=3
Kuang and Leslie et al.(JBCB, 2005)
Profile kernelProfile kernel• Use profile to define position-dependent mutation neighborhoods:
• E.g. k=3, σ=5 and a profile of negative log probabilities
P(x) = p j (b),b∈ Σ, j =1K x{ }
AKQYKQ(2+1+1<σ) AKQ
(1+1+1<σ)AKC
(1+1+2<σ)
YKC(2+1+1<σ)
( 0 , … , 1 , … , 1 , … , 1 , … , 1 , … , 0 )AKC AKQ YKC YKQ
AKQ
A K Q …A 1 3 4 …C 5 4 1 …D 4 4 4 …… … … … …K 4 1 4 …… … … … …Q 3 4 1 …… … … … …Y 2 4 3 …
M k ,σ( ) P x j + 1: j + k[ ]( )( )=b1b2Lbk :− log p j + i bi( )( )i∑ <σ{ }
Positional classification scoresPositional classification scores
PDB ID: 1f2e PDB ID:1hnf
A simple probabilistic model to detect domains:
)|||,1(*)|,(*...*)|1,1(*)|,(
*)|1,1(*)|,(*)|1,()|,(
3222
211110
FFePFesPFSePFesP
FsePFesPFssPFESP
nnn +−+
−+−=
ExperimentsExperiments1.Dataset
• 7,329 sequences from SCOP 1.59.
• Sequence identity less than 95%.
2.Preliminary Results (with a simplified model)
21.9%31.1%Domain end
36.0%51.1%Domain start
73.1%73.2%Domain positions
CoverageAccuracyCriteria
1. Introduction 2. Backbone angle prediction3. Domain detection4. Discussion
Part 4: DiscussionPart 4: Discussion
•Dependency between conformational states or protein blocks does not help much in the 2-stage window-based approach.
•Struct-SVM does not scale very well for large problems. Perceptron training may speed up the training stage.
•A proper probabilistic model is needed for detecting domain boundaries from positional classification scores
AcknowledgementAcknowledgement•William Stafford Noble
Genome Science Department, University of Washington•Asa Ben-Hur
Genome Science Department, University of Washington
•An-Suei YangGenome Research Center, Academia Sinica of Taiwan
•Yasemin AltunToyota Technological Institute at Chicago
•Thorsten JoachimsComputer Science Department, Cornell University