Upload
hasad
View
61
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Protein homology detection by HMM–HMM comparison Johannes Söding. A topic in Sequence analysis Presented by: Giriprasad Sridhara [email protected] CISC 841 Spring 2006 APR 20 2006. Organization of presentation. Introduction Theory Results Conclusion. Introduction. Paper Details: - PowerPoint PPT Presentation
Citation preview
Protein homology detectionby HMM–HMM comparison
Johannes Söding
A topic in Sequence analysisPresented by:
Giriprasad [email protected]
CISC 841 Spring 2006APR 20 2006
Organization of presentation
• Introduction
• Theory
• Results
• Conclusion
Introduction
• Paper Details:– Bioinformatics journal– Vol. 21 no. 7 2005, pages 951–960
• Author Details– Dr. Johannes Söding
• Department of Protein Evolution,
MaxPlanckInstitute for Developmental Biology,
Spemannstrasse 35, D72076
Tübingen, Germany
Introduction
• Tool Details:– A tool HHPred has been developed.– Described in Nucleic Acid Research, 2005,
Vol 33– A web server is available at
http://www.protevo.eb.tuebingen.mpg.de/toolkit/index.php?view=hhpred
Introduction
• Central theme in bioinformatics:– Homology and sequence alignment
• Issues:– Problem of finding a close homolog with known
function or structure which would allow to make inferences about the protein under observation.
– New and highly sensitive methods could detect and align remotely homologous sequences that provide information about the protein’s function, structure or evolution.
Introduction
• Methods (Tools) of homology detection:(In increasing order of sensitivity)– Sequence - Sequence
• BLAST• FASTA
– Profile - Sequence• PSIBLAST
– More sensitive since it uses a sequence profile– Profile – Profile
• COMPASS• PROF_SIM
– Profile - HMM• HMMER
– HMM-HMM• HHPred
Introduction
• Sequence profiles– Built from a multiple alignment of homologous sequences– Contains more information about the sequence family than a
single sequence.– Helps to distinguish between
• conserved and non-conserved positions • Conserved are important for defining members of the family • Non-conserved are variable among the members of the family.
– Describe exactly • what variation in amino acids is possible at each position • Done by recording the probability for the occurrence of each amino acid along the multiple alignment.
Introduction
• Profile Hidden Markov Models (Profile HMMs)– Similar to simple sequence profiles– have amino acid frequencies as in the columns of a MSA– Also have position specific probabilities for inserts and deletions
along the alignment– logarithms of these probabilities =position specific gap penalties– Perform better than sequence profiles in the detection of
homologs and in the quality of alignments– Why higher sensitivity?
• Position specific gap penalties penalize chance hits much more than true positives
– which tend to have insertions or deletions at the same positions as the sequences from which the HMM was built.
Introduction
• Pictorial representation of profile HMM
With M, I and D states.
Theory
• Align 2 HMM by maximizing a score– Score is log-sum-of-odds score.
• What does a path through the 2 HMMs Represent?
• A sequence co-emitted by both HMMs
• How do we find this path?– Use dynamic programing (Viterbi) – Find path that maximizes log-sum-of-odds
score
Theory
• Advantages of HMM-HMM– Improves both sensitivity and alignment quality
• Calibrate the score for additional sensitivity– Use scoring correlation function– Use secondary structure information
• Even sequences that are distantly homologous will have similar secondary structures.
• This can help distinguish real homologs from chance hits• Biologically, secondary structures diverge more slowly than
sequences• This knowledge is utilized.
Theory
Theory
• Additionally to enhance homology detection– Score secondary structure– Use other available additional information (like
confidence – term covered later on in the slides)
• Tool HHPred– Homology detection & structure prediction– Novelty
• HMM-HMM comparison• Scores secondary structure• Reliability measured by
– Probability of each match being a true positive– Used since e-values reported by most tools can be inaccurate
Theory (Log-sum-of-odds score)
• Defined as
• Numerator– probability that x1,…xL is co-emitted by both HMMs
along the alignment path
• Denominator – Standard null model probability
• Summation– Runs over all sequences of L residues that can be emitted
along the alignment path by both HMMs
Lxx
NULL))|x,...,P(xpath)on emission | x..., ,(x (P S 1,..
L1L1LSO log
Theory
• How do we apply Viterbi algorithm?• Denote
– 2 HMMs p and q– Probability of emitting amino acid a in match state i or j is qi(a)
and pj(a)– Trans prob = qi(X, X’) and pj(Y,Y’)– X or Y can belong to {M, I or D}– f(a) = fixed background frequency– Let Xk and Yk be states in q and p in the k’th column of the
alignment of q with p.– i(k) and j(k) be the corresponding columns from q and p.
– qk(l)P (a) and pk(l)
P(a) = emission prob from q and p.
Theory
• Ρ tr is the product of all transition probabilities for the path through p and q
• qk(l)P (a) = qi(k) (a) for Xk = M
• qk(l)P (a) = f (a) for Xk = I
20
1
)()(
:
1
20
1)()(
20
1 1
/)()(
20
11
,..1 1)(
1)(
)(/)()(log),(
log),(
log))(/)()((log
)(/)()(...log
))(/())()((log
a
jijiaa
trkjki
MMXk
aa
tr
L
l a
lP
lkl
P
lk
xL
L
l
ltrlP
lkl
P
lkx
xLx
Ll
l
ltrlP
lk
Ll
l
lP
lkLSO
afapaqeColumnScorpqS
PpqS
Pafxx
xfPxx
xfPxxS
kYk
pq
pq
pq
Theory• Column score properties:
– Positive when 2 distributions are similar– Negative otherwise– Insert states have vanishing column score
• Completely non-conserved, pj(a) = f(a)
– 1/f(a)• Weight factor to co-emission probability• For a rare amino acid
– f(a) will be low 1/f(a) will be high Weight of rarer amino acids increases in the score
calculation as compared to common amino acids.
Theory
• Pair-wise alignment of HMMs
• Allowed transitions
• Dynamic programing matrices for Viterbi
Theory
• We use 5 DP matrices S xy one for each pair state XY belonging to {MM, MI, IM, DG, GD}
• SMM (i, j) = Saa(qi,pj) + max {
SMM(i-1,j-1) + log[q i-1(M,M) p j-1(M,M)],
SMI(i-1,j-1) + log[q i-1(M,M) p j-1(I,M)]
SIM(i-1,j-1) + log[q i-1(I,M) p j-1(M,M)]
SDG(i-1,j-1) + log[q i-1(D,M) p j-1(M,M)]
SGD(i-1,j-1) + log[q i-1(M,M) p j-1(D,M)] }
Theory• SMI (i, j) = max {
SMM(i-1,j) + log[q i-1(M,M) p j-1(M,I)],
SMI(i-1,j) + log[q i-1(M,M) p j-1(I,I)] }
• SDG (i, j) = max {
SMM(i-1,j) + log[q i-1(M,D)],
SDG(i-1,j) + log[q i-1(D,D) }
• Initialize SMM(I,0) = 0 = SMM(0,j)
• S LSO = max over last row, col of S MM
• Trace back from this cell.
Theory• Scoring correlations
– Clustering• In an alignment of 2 homologous HMMs
– Expect high column scores in» Clusters along the sequence
• In an alignment of non-homologous HMMs– Do not Expect any clustering.
– The above can help• Differentiate homologous and non-homologous alignments
– If l’th pair state of optimum path aligns columns i(l) of q and j(l) of p
• Sl = SAA(qi(l), pj(l)) iff l’th pair state = MM, else 0.
– Auto-correlation function
dL
l
l dlSSdg1
)(
Theory• Scoring correlations
– Auto-correlation function describes correlation of Sl at a fixed sequence separation d
– Expect• if 2 HMMs are homologous
– A Positive g(d) for small d.
– Add a correction factor
– wcorr is found empirically to be 0.1
– The correction factor is added after the best alignment
is found.
)(4
1
dgwSd
corrcorr
Theory
• Scoring secondary structure– Allows to score predicted secondary structure against
• Another predicted secondary structure• Or a known secondary structure
• Predicted secondary structure vs. known secondary structure.– DSSP used to assign 1 of 7 states of observed secondary
structure
– PSIPRED used to predict secondary structure states, H, E or C.
– Predict secondary structure of all domains in SCOP (filtered to twilight zone)
– Compare the PSIPRED predictions with DSSP
– Get the count of combination of (σ;ρ,c).• σ belongs to {H,E,C,G,B,S,T}• ρ belongs to {H,E,C}• c belongs to {0,1,…,9}
Theory
• Scoring secondary structure
– Derive 10 3*7 substitution matrices (one for
each confidence value)
Mss(σ;ρ,c) = log (P (σ;ρ,c)/P(σ)P(ρ,c))
• Let – Column i of HMM q have pred sec struct ρi
q and confidence value ciq
– Column j of HMM p have known sec struct σjp (Note: known sec
struct secondary structure of seed seq of alignment)
– Define
• Sss(q I p j) = wss Mss(σjp;ρi
q ciq)
• Empirically Wss is 1/7.
• This score is added to amino acid column score Saa(qi, pj) for use in
the Viterbi algorithm.
Theory
• Scoring secondary structure (predicted vs predicted)
• The above matrix informs– How much more probable is it to get the
predictions ρiq ci
q and ρjp cj
p for a pair of aligned homologous residues than to get them independently of each other.
• Sss(q I p j) = wss Mss(ρiq ci
q ρjp cj
p)
– Empirically Wss is 1/7.
– This score is added to amino acid column score Saa(qi, pj)
for use in the Viterbi algorithm.
),(),(/)()|,()|,(log),,,( p
j
p
j
q
i
q
i
p
j
p
j
q
i
q
i
p
j
p
j
q
i
q
iss cPcPPcPcPccM
Results and Discussion
• All-against-All comparison with the following similarity search tools:– Sequence-Sequence
• BLAST
– Profile-Sequence• PSI-BLAST
– HMM-Sequence• HMMER
– Profile-Profile• COMPASS• PROF_SIM
• Test – Input below the twilight zone
– Ability to detect remote homologs
– Ability to give high-quality alignments.
Results and Discussion
• Different versions of tool used for better juxtaposition of results– HHSearch 0
• Simple profile-profile comparison • Gap opening penalty = -3.5, Gap Extension = -0.2• Above used instead of transition prob log
– HHSearch 1• Basic HMM-HMM version
– HHSearch 2• Version 1 + inclusion of correlation score
– HHSearch 3• Version 2 + usage of predicted vs predicted secondary
structure
– HHSearch 4• Version 3 + usage of predicted vs known secondary structure
Results and Discussion
• SCOP (structural classification of proteins) database with filtering for twilight zone used.
• Detection of homologs:– Domain in SCOP
• Family or superfamily or fold or class
– Pair of domains are homologous• If they are members of the same super family
• Domains from different classes are classified as non-homologous
• We present a chart of TP vs FP– TP homologous pairs– FP non-homologous pairs.
Results and Discussion
• The figure shows classical sensitivity in the benchmark test.
Results and Discussion
• Alternative definition of TP and FP– A pair is a TP
• If the domains belong to same SCOP super-family• Or if the seq based alignment gives structural alignment with a
“maxSub” score of at least 0.1– A pair is a FP
• If it is from different classes and has 0 MaxSub score– What is MaxSub score?
• Informally– Defined such that a value > 0 occurs very rarely by chance– It tells what fraction of the query residues can be superposed
structurally with the aligned residues from the other structure.• Formally
– Weighted number of aligned pairs that can be superimposed with a maximum distance per pair of 3.5 Angstrom units/number of residues in the query sequence
– Pairs with 0 Angstorm deviation wieght 1– Pairs with 3.5 Angstorm deviation wieght 0.5
Results and Discussion
• Plot of TP vs FP with new definition of TP and FP
Results and Discussion
• Observation– More sensitive tools which use secondary
structure (HHSearch 3, 4) improve – Reason
• Reclassification of “harder to detect” ones as TP helps the more sensitive tools, since they would detect these.
Results and Discussion(Alignment quality)
• Sequence alignment assessed by– Looking at the spatial distances between aligned pair of residues
• upon superposition of the 3D structures
• 2 scores used.• maxSub score
– Drawback• Does not penalize over-prediction
• Developer’s score– S Dev = N correct/min (Lq, Lp)– N Correct = No of residue pairs that are present in the max subset identified by
maxSub– Lq and Lp = No of residues in the 2 sequences to be aligned.
• Modeler’s score– S Mod = N correct / L ali– L Ali = No of aligned residue pairs in the seq alignment.– Does not penalize under-prediction.
• Balanced score– S balanced = (S dev + S mod) / 2– Penalizes both under and over prediction
Results and Discussion
Results and Discussion
• HHSearch3 performs the best– Family level
• Aligns 58% of all pairs with balanced score >= 0.3• 1.23 times more than COMPASS• 1.28 times more than PROF_SIM• 1.34 times more than HMMER• 1.57 times more than PSI_BLAST• 4.4 times more than BLAST
– Super family level • Aligns 27% of all pairs with balanced score >= 0.3• 1.7 times more than COMPASS• 1.9 times more than PROF_SIM• 2.2 times more than HMMER• 2.9 times more than PSI_BLAST• 14 times more than BLAST
Results and Discussion
• HHSearch3 performs the best– Fold level
• Aligns 4.5% of all pairs with balanced score >= 0.3• 3.3 times more than COMPASS• 6.0 times more than PROF_SIM• 7.3 times more than HMMER• 9.4 times more than PSI_BLAST• 63 times more than BLAST
– Actually 4.5% at fold level is a lot– Pairs aligned at fold level are deemed non-
homologous by SCOP– So we do not expect any good alignments at all
Conclusion
• A generalization of HMM – Sequence alignment– Pairwise alignment of profile HMMs
• Algorithm to maximize log-sum-of-odds score– Generalization of log-odds score
• Increased sensitivity of 5-10%– Due to derivation of novel correlation score
• Statistical methods for– Scoring predicted vs known secondary structure– Predicted vs predicted secondary structure– Uses confidence values of secondary structure
prediction
Conclusion
• HHPred– New tool based on the research paper
• Benchmarking– With 5 other homology detection tools– Dataset in twilight zone
• Results– Improvement in
• Sensitivity• Alignment quality
Thank you.
Have a nice day!