Consensus Fold Recognition Methods
Dongbo BuSchool of Computer Science
University of Waterloo
Joint work withS.C. Li, X. Gao, L. Yu, J. Xu, M. Li
Nov. 2006
Outline
• Background
• Consensus Prediction Methods
• ACE7: consensus method by identifying latent servers
• Experimental Results
• Future Work
Background
From sequence to structure
• The Rate Gap – gene prediction is fast,– but experimental structure
determination is slow
• The First Principle– Sequence almost determine
structure
• CASP Competition– A fair and objective examination
Computational Methods
motivation
possibility
benchmark
Homologous Modeling --- sequence-sequence alignment
Threading---sequence-structure alignment
Ab initio--- database independent
Why Consensus?
• Observation:– no single server can reliably predict the best
models for all the targets. – a particular structure prediction server may
perform well on some targets, but badly on others.
• A natural idea to solve this issue:– to combine the strengths of different
prediction methods to obtain better structural models.
What is Consensus Method?
Formal Description
• Notations:– Target: the query protein sequence– Server: implementation of a prediction method– Model: a predicted structure
Classical Consensus Methods
Research History
• Early exploration of consensus idea:– Consensus many methods in one server.– INBGU (SHGU) D. Fischer 2000– 3D-PSSM (Phyre) L. Kelly 2000
• The first consensus server– CAFASP-CONSENS: D. Fischer 2001
• Successors: – Pcons/Pmodeller J. Lundstrom, A.
Elofsson 2001– 3D-Jury K. Ginalski, A. Elofsson 2003– 3D-Shotgun D. Fischer 2003– ACE L. Yu, J. Xu, M. Li 2004
Three-step Process
• Step1: Model Comparison– determine model similarities
• Step2: Feature Extraction– formal description of a model
• Step3: Model Selection – select a model, or part of it.
• Many machine learning techniques were introduced in the 3rd step.
3D-Shotgun: Majority Voting
• Basic Idea:– Reminiscent of “cooperative algorithms”
• Five Input Servers:– GONP, GONPM, PRFSEQ, SEQPPRF,
SEQPMPRF
• Step 1. Model Comparison– For each initial model, to find models with
LOCAL similarity.
3D-Shotgun (cont)
• Step 2. Feature Extraction– For each model M, superimpose similar
models upon M,– Using the shared similarity to compute
transformation– Build a multiple structure alignment A(M) as a
result,– Feature:
• the number of models share structural element with A(M).
3D-Shotgun (cont)
• Step 3. Selection– Majority Voting– Choose the structural element with the highest count.– The underlying rationale:
• The recurring structural elements are most likely to be correct.
Confidence Assignment
• For each assembled model M’, a confidence score S’ is given as follows:
• Here, – k,l run over all the input models– S_{k,l} is the confidence score given by the individual
server– Sim() adopts MaxSub.
Performance of 3D-Shotgun
CAFASP-Consensus and Pcons: Neutral Network
• Step 1. Model Comparison– CAFASP-Consensus: check SCOP id, or run MaxSub– Pcons: LGScore2 to detect similarity
• Step 2: Feature Extraction– CAFASP-Consensus: number of similar models– Pcons:
ratio of the similar models
weighted f1
ratio of the similar 1st model
CAFASP-Consensus and Pcons: (cont)
• Step 3. Model Selection– Formulated into a
machine learning problem
– Attribute: • Log(LGScore2),
significantly better than LGScore2.
Pmodeller = Pcons + ProQ
• ProQ:– a neutral network package to measure the
quality of a structure
• Pmodeller has an advantage over Pcons because a number of high-score but false-positive models are eliminated.
Performance of Pcons/Pmod
ACE: SVM Regression
• Step 1. Model Comparison– MaxSub
• Step 2. Feature Extraction
– f1: the normalized similarity with all the other models– f2: the normalized similarity with the most similar one– f3: for each target, to measure the divergence of server
predictions.
ACE (cont)
• Step 3: Selection– SVM Regression: to predict the model quality– Attribute:
• MaxSub with the native structure
Performance of ACE• In CASP6, ACE was ranked 2nd among 87
automatic servers. • On LiveBench test set:
Other techniques• 3D-Jury:
– Rationale: the average of lower energy conformation is similar to the native structure.
– Basic Idea: Mimic the average step by the following scoring function:
Other techniques (cont)
• Robetta:– For each fragment, choose a local structure
from a set, and assemble them to minimize an energy funtion.
• BPROMPT: – Bayesian Belief Network
• JPred:– Decision Tree
CASP7 Performance
ACE7: A Consensus Method by Identifying Latent Servers
Motivation
• Server Correlation:– Although consensus servers assume that
each individual server is independent of others, it is observed from CASP6 results that correlation exists between different servers to some degree.
• Negative Effect:– this kind of correlation sometimes makes a
native-like model receive less support than the incorrect models.
Examination of ACE on CASP6 Dataset
• Observation:– If a native-like model receives support from only 1or 2
server, it is difficult to select it.
Source of Server Correlation
• Server Correlation:– some servers tend to generate similar results,
• Reason:– Roughly speaking, the correlations arose from the fact
that these servers adopt similar techniques, including sequence alignment tools, secondary structure prediction methods, and scoring functions,etc.
• Latent Servers: – Here, we use independent latent servers to represent
the common features shared by these implicit servers.
ACE7: to reduce the server correlation
• Step 1. Adopting Maximum Likelihood to estimate the server correlation.
• Step 2. Employing Principle Component Analysis technique to derive the latent servers.
• Step 3. Using an ILP model to weigh the latent servers.
Two Assumptions of ACE7
• Assumption 1:
– Here, we approximate Ci,m by:
• Assumption 2:
Maximum Likelihood Estimation of Server Correlation
Here,
Server Correlation
• Observation:– The server correlation is significant with respect to the fact that there are
thousands of candidate models.– some servers are correlated more tightly than others.
• mGenThreader and RAPTOR (0.383) vs. FUGUE3 and Prospect (0.182).
• Implication: – These individual server may be clustered into cliques according to
correlations; – the servers in a small clique may be underestimated according to the simple
“majority voting” rule.
Uncovering the Latent Server
•
Uncovering the Latent Servers (cont)
• Using the PCA technique, the latent severs can be estimated as:
Explanation of Latent Servers
• Observation:– H1: represents MGTH and RAPT– H2: SPKS– H3: FUG3– H4: ST02– H5: PROS– H6: no preference
Construct a More Accurate Server
• Since latent servers are mutually independent, it is reasonable to assume:
• Key Point:– How to set the weight of each latent server?
– An ILP model:• To maximize the gap between the scores of the native-like
models and incorrect models.
ILP Model (soft-margin idea)
Experiment on CASP7 Dataset
• Observation:– For T0363, ACE7 succeeds even only one server votes the native-like
model.
Sensitivity of ACE7
• Observation: – ACE7 has a higher sensitivity than any individual
server.
Future Work
Conclusion
• Though consensus methods rely on structure clustering property, the server correlation also bring negative effect.
•
Future Work
• To find a better approximation of Ci,m.
• Using MaxSub instead of GDT.
• RAPTOR has a good performance in choosing the top 5 models, but always be puzzled to choose the top 1 model.
• We try to help to choose the best from the top 5 models remains an open problem.
Thanks.