Protein homology detection by HMM–HMM comparison Johannes Söding

Protein homology detectionby HMM–HMM comparison

Johannes Söding

A topic in Sequence analysisPresented by:

Giriprasad [email protected]

CISC 841 Spring 2006APR 20 2006

Organization of presentation

• Introduction

• Theory

• Results

• Conclusion

Introduction

• Paper Details:– Bioinformatics journal– Vol. 21 no. 7 2005, pages 951–960

• Author Details– Dr. Johannes Söding

• Department of Protein Evolution,

MaxPlanckInstitute for Developmental Biology,

Spemannstrasse 35, D72076

Tübingen, Germany

Introduction

• Tool Details:– A tool HHPred has been developed.– Described in Nucleic Acid Research, 2005,

Vol 33– A web server is available at

http://www.protevo.eb.tuebingen.mpg.de/toolkit/index.php?view=hhpred

Introduction

• Central theme in bioinformatics:– Homology and sequence alignment

• Issues:– Problem of finding a close homolog with known

function or structure which would allow to make inferences about the protein under observation.

– New and highly sensitive methods could detect and align remotely homologous sequences that provide information about the protein’s function, structure or evolution.

Introduction

• Methods (Tools) of homology detection:(In increasing order of sensitivity)– Sequence - Sequence

• BLAST• FASTA

– Profile - Sequence• PSIBLAST

– More sensitive since it uses a sequence profile– Profile – Profile

• COMPASS• PROF_SIM

– Profile - HMM• HMMER

– HMM-HMM• HHPred

Introduction

• Sequence profiles– Built from a multiple alignment of homologous sequences– Contains more information about the sequence family than a

single sequence.– Helps to distinguish between

• conserved and non-conserved positions • Conserved are important for defining members of the family • Non-conserved are variable among the members of the family.

– Describe exactly • what variation in amino acids is possible at each position • Done by recording the probability for the occurrence of each amino acid along the multiple alignment.

Introduction

• Profile Hidden Markov Models (Profile HMMs)– Similar to simple sequence profiles– have amino acid frequencies as in the columns of a MSA– Also have position specific probabilities for inserts and deletions

along the alignment– logarithms of these probabilities =position specific gap penalties– Perform better than sequence profiles in the detection of

homologs and in the quality of alignments– Why higher sensitivity?

• Position specific gap penalties penalize chance hits much more than true positives

– which tend to have insertions or deletions at the same positions as the sequences from which the HMM was built.

Introduction

• Pictorial representation of profile HMM

With M, I and D states.

Theory

• Align 2 HMM by maximizing a score– Score is log-sum-of-odds score.

• What does a path through the 2 HMMs Represent?

• A sequence co-emitted by both HMMs

• How do we find this path?– Use dynamic programing (Viterbi) – Find path that maximizes log-sum-of-odds

score

Theory

• Advantages of HMM-HMM– Improves both sensitivity and alignment quality

• Calibrate the score for additional sensitivity– Use scoring correlation function– Use secondary structure information

• Even sequences that are distantly homologous will have similar secondary structures.

• This can help distinguish real homologs from chance hits• Biologically, secondary structures diverge more slowly than

sequences• This knowledge is utilized.

Theory

Theory

• Additionally to enhance homology detection– Score secondary structure– Use other available additional information (like

confidence – term covered later on in the slides)

• Tool HHPred– Homology detection & structure prediction– Novelty

• HMM-HMM comparison• Scores secondary structure• Reliability measured by

– Probability of each match being a true positive– Used since e-values reported by most tools can be inaccurate

Theory (Log-sum-of-odds score)

• Defined as

• Numerator– probability that x1,…xL is co-emitted by both HMMs

along the alignment path

• Denominator – Standard null model probability

• Summation– Runs over all sequences of L residues that can be emitted

along the alignment path by both HMMs

Lxx

NULL))|x,...,P(xpath)on emission | x..., ,(x (P S 1,..

L1L1LSO log

Theory

• How do we apply Viterbi algorithm?• Denote

– 2 HMMs p and q– Probability of emitting amino acid a in match state i or j is qi(a)

and pj(a)– Trans prob = qi(X, X’) and pj(Y,Y’)– X or Y can belong to {M, I or D}– f(a) = fixed background frequency– Let Xk and Yk be states in q and p in the k’th column of the

alignment of q with p.– i(k) and j(k) be the corresponding columns from q and p.

– qk(l)P (a) and pk(l)

P(a) = emission prob from q and p.

Theory

• Ρ tr is the product of all transition probabilities for the path through p and q

• qk(l)P (a) = qi(k) (a) for Xk = M

• qk(l)P (a) = f (a) for Xk = I

20

1

)()(

:

1

20

1)()(

20

1 1

/)()(

20

11

,..1 1)(

1)(

)(/)()(log),(

log),(

log))(/)()((log

)(/)()(...log

))(/())()((log

a

jijiaa

trkjki

MMXk

aa

tr

L

l a

lP

lkl

P

lk

xL

L

l

ltrlP

lkl

P

lkx

xLx

Ll

l

ltrlP

lk

Ll

l

lP

lkLSO

afapaqeColumnScorpqS

PpqS

Pafxx

xfPxx

xfPxxS

kYk

pq

pq

pq

Theory• Column score properties:

– Positive when 2 distributions are similar– Negative otherwise– Insert states have vanishing column score

• Completely non-conserved, pj(a) = f(a)

– 1/f(a)• Weight factor to co-emission probability• For a rare amino acid

– f(a) will be low 1/f(a) will be high Weight of rarer amino acids increases in the score

calculation as compared to common amino acids.

Theory

• Pair-wise alignment of HMMs

• Allowed transitions

• Dynamic programing matrices for Viterbi

Theory

• We use 5 DP matrices S xy one for each pair state XY belonging to {MM, MI, IM, DG, GD}

• SMM (i, j) = Saa(qi,pj) + max {

SMM(i-1,j-1) + log[q i-1(M,M) p j-1(M,M)],

SMI(i-1,j-1) + log[q i-1(M,M) p j-1(I,M)]

SIM(i-1,j-1) + log[q i-1(I,M) p j-1(M,M)]

SDG(i-1,j-1) + log[q i-1(D,M) p j-1(M,M)]

SGD(i-1,j-1) + log[q i-1(M,M) p j-1(D,M)] }

Theory• SMI (i, j) = max {

SMM(i-1,j) + log[q i-1(M,M) p j-1(M,I)],

SMI(i-1,j) + log[q i-1(M,M) p j-1(I,I)] }

• SDG (i, j) = max {

SMM(i-1,j) + log[q i-1(M,D)],

SDG(i-1,j) + log[q i-1(D,D) }

• Initialize SMM(I,0) = 0 = SMM(0,j)

• S LSO = max over last row, col of S MM

• Trace back from this cell.

Theory• Scoring correlations

– Clustering• In an alignment of 2 homologous HMMs

– Expect high column scores in» Clusters along the sequence

• In an alignment of non-homologous HMMs– Do not Expect any clustering.

– The above can help• Differentiate homologous and non-homologous alignments

– If l’th pair state of optimum path aligns columns i(l) of q and j(l) of p

• Sl = SAA(qi(l), pj(l)) iff l’th pair state = MM, else 0.

– Auto-correlation function

dL

l

l dlSSdg1

)(

Theory• Scoring correlations

– Auto-correlation function describes correlation of Sl at a fixed sequence separation d

– Expect• if 2 HMMs are homologous

– A Positive g(d) for small d.

– Add a correction factor

– wcorr is found empirically to be 0.1

– The correction factor is added after the best alignment

is found.

)(4

1

dgwSd

corrcorr

Theory

• Scoring secondary structure– Allows to score predicted secondary structure against

• Another predicted secondary structure• Or a known secondary structure

• Predicted secondary structure vs. known secondary structure.– DSSP used to assign 1 of 7 states of observed secondary

structure

– PSIPRED used to predict secondary structure states, H, E or C.

– Predict secondary structure of all domains in SCOP (filtered to twilight zone)

– Compare the PSIPRED predictions with DSSP

– Get the count of combination of (σ;ρ,c).• σ belongs to {H,E,C,G,B,S,T}• ρ belongs to {H,E,C}• c belongs to {0,1,…,9}

Theory

• Scoring secondary structure

– Derive 10 3*7 substitution matrices (one for

each confidence value)

Mss(σ;ρ,c) = log (P (σ;ρ,c)/P(σ)P(ρ,c))

• Let – Column i of HMM q have pred sec struct ρi

q and confidence value ciq

– Column j of HMM p have known sec struct σjp (Note: known sec

struct secondary structure of seed seq of alignment)

– Define

• Sss(q I p j) = wss Mss(σjp;ρi

q ciq)

• Empirically Wss is 1/7.

• This score is added to amino acid column score Saa(qi, pj) for use in

the Viterbi algorithm.

Theory

• Scoring secondary structure (predicted vs predicted)

• The above matrix informs– How much more probable is it to get the

predictions ρiq ci

q and ρjp cj

p for a pair of aligned homologous residues than to get them independently of each other.

• Sss(q I p j) = wss Mss(ρiq ci

q ρjp cj

p)

– Empirically Wss is 1/7.

– This score is added to amino acid column score Saa(qi, pj)

for use in the Viterbi algorithm.

),(),(/)()|,()|,(log),,,( p

j

p

j

q

i

q

i

p

j

p

j

q

i

q

i

p

j

p

j

q

i

q

iss cPcPPcPcPccM

Results and Discussion

• All-against-All comparison with the following similarity search tools:– Sequence-Sequence

• BLAST

– Profile-Sequence• PSI-BLAST

– HMM-Sequence• HMMER

– Profile-Profile• COMPASS• PROF_SIM

• Test – Input below the twilight zone

– Ability to detect remote homologs

– Ability to give high-quality alignments.


• Different versions of tool used for better juxtaposition of results– HHSearch 0

• Simple profile-profile comparison • Gap opening penalty = -3.5, Gap Extension = -0.2• Above used instead of transition prob log

– HHSearch 1• Basic HMM-HMM version

– HHSearch 2• Version 1 + inclusion of correlation score

– HHSearch 3• Version 2 + usage of predicted vs predicted secondary

structure

– HHSearch 4• Version 3 + usage of predicted vs known secondary structure


• SCOP (structural classification of proteins) database with filtering for twilight zone used.

• Detection of homologs:– Domain in SCOP

• Family or superfamily or fold or class

– Pair of domains are homologous• If they are members of the same super family

• Domains from different classes are classified as non-homologous

• We present a chart of TP vs FP– TP homologous pairs– FP non-homologous pairs.


• The figure shows classical sensitivity in the benchmark test.


• Alternative definition of TP and FP– A pair is a TP

• If the domains belong to same SCOP super-family• Or if the seq based alignment gives structural alignment with a

“maxSub” score of at least 0.1– A pair is a FP

• If it is from different classes and has 0 MaxSub score– What is MaxSub score?

• Informally– Defined such that a value > 0 occurs very rarely by chance– It tells what fraction of the query residues can be superposed

structurally with the aligned residues from the other structure.• Formally

– Weighted number of aligned pairs that can be superimposed with a maximum distance per pair of 3.5 Angstrom units/number of residues in the query sequence

– Pairs with 0 Angstorm deviation wieght 1– Pairs with 3.5 Angstorm deviation wieght 0.5


• Plot of TP vs FP with new definition of TP and FP


• Observation– More sensitive tools which use secondary

structure (HHSearch 3, 4) improve – Reason

• Reclassification of “harder to detect” ones as TP helps the more sensitive tools, since they would detect these.

Results and Discussion(Alignment quality)

• Sequence alignment assessed by– Looking at the spatial distances between aligned pair of residues

• upon superposition of the 3D structures

• 2 scores used.• maxSub score

– Drawback• Does not penalize over-prediction

• Developer’s score– S Dev = N correct/min (Lq, Lp)– N Correct = No of residue pairs that are present in the max subset identified by

maxSub– Lq and Lp = No of residues in the 2 sequences to be aligned.

• Modeler’s score– S Mod = N correct / L ali– L Ali = No of aligned residue pairs in the seq alignment.– Does not penalize under-prediction.

• Balanced score– S balanced = (S dev + S mod) / 2– Penalizes both under and over prediction



• HHSearch3 performs the best– Family level

• Aligns 58% of all pairs with balanced score >= 0.3• 1.23 times more than COMPASS• 1.28 times more than PROF_SIM• 1.34 times more than HMMER• 1.57 times more than PSI_BLAST• 4.4 times more than BLAST

– Super family level • Aligns 27% of all pairs with balanced score >= 0.3• 1.7 times more than COMPASS• 1.9 times more than PROF_SIM• 2.2 times more than HMMER• 2.9 times more than PSI_BLAST• 14 times more than BLAST


• HHSearch3 performs the best– Fold level

• Aligns 4.5% of all pairs with balanced score >= 0.3• 3.3 times more than COMPASS• 6.0 times more than PROF_SIM• 7.3 times more than HMMER• 9.4 times more than PSI_BLAST• 63 times more than BLAST

– Actually 4.5% at fold level is a lot– Pairs aligned at fold level are deemed non-

homologous by SCOP– So we do not expect any good alignments at all

Conclusion

• A generalization of HMM – Sequence alignment– Pairwise alignment of profile HMMs

• Algorithm to maximize log-sum-of-odds score– Generalization of log-odds score

• Increased sensitivity of 5-10%– Due to derivation of novel correlation score

• Statistical methods for– Scoring predicted vs known secondary structure– Predicted vs predicted secondary structure– Uses confidence values of secondary structure

prediction

Conclusion

• HHPred– New tool based on the research paper

• Benchmarking– With 5 other homology detection tools– Dataset in twilight zone

• Results– Improvement in

• Sensitivity• Alignment quality

Thank you.

Have a nice day!

Documents

Protein homology detection by HMM–HMM comparison Johannes Söding