T coffee algorithm dissection

T-coffee: A Method for Fast and Accurate Multiple Sequence Alignment

Chen, Gui 03/18/2015

Backgroud & Motivation

Algorithm Illustration

Validation & Result

dddddd

Why do We Need Multiple Sequences Alignment?

Homology Modeling

Phylogenetic reconstruction

Illustrate conserved and variable sites within a family

Can be used to construct profile or HMM to scour databases of distantly related members of the family

When construct MSA, theoretically we should consider evolution and structural relationships within the family. However…

1. Specific expertise knowledge(if not lacking) is hard to be integrated into algorithm

2. General empirical models of protein evolution doesn’t work well with sequences are less than 30% identical

3. Mathematically sound methods is prohibitively demanding in computer resources

That is why we introduce Heuristic method.

A Brief Review of Previous Methods to Construct MSA

ClustalW: Heuristic method

MSA & DCA:

Mathematically sound

Prrp&Muscle: Iterative Method

dynamic programing to build guide tree,

then do progressive alignments

comment: fast but suffers from its greediness

(once a gap always a gap) no local alignment

information, take little reference from other

sequences in sequences set during construct the MSA

simultaneous alignment of all the sequences

Carrilo and Lipman Algorithm,Multipledimensional dynamic

programing

comment: extremely CPU and memory-intensive approach

and not better than other methods in term of alignment

performance, no local alignment information taken

heuristic method combined with iterative Method

comment: an interesting method, neither significantly faster nor align better than

ClustalW , no local alignment information, take some reference from other

sequences in sequences set during construct the MSA

Question: theoretically speaking, why iterative method is not better than ClustalW? and why local alignment information should be taken?

A Brief Review of Previous Methods to Construct MSA

Dialign2: Heuristic method

simultaneous alignment of all sequences but with crude

heuristic local ailignment method

comment: fast but only consider local

alignment information and to some extent consider reference

from all sequences in a sequence set , align poorly in

practice use.

Question: theoretically speaking, why iterative method is not better than ClustalW? And why local alignment information should be taken?

…so the motivation now is to build a method with all merits listed below:

Combine information from global alignment

Combine information from local alignment

Take reference from other sequences in sequence set during alignment

T-CoffeeMethod

Back

Hueristic method with practical computational time

local alignment is more sensitive to domain and motif

reference from other sequences should be taken from other sequences when align conserved part

optimal alignment not equivalent to biological meaningful alignment

*A Case Will Fail ClustalW Method



Validation & Result

An Overview of T-coffee Method

…so how T-coffee satisfy all the merits mentioned earlier?

Requirement solution

Hueristic method with practical computational time Progressive Alignment codes from ClustalW

Combine information from global alignment

Primary Library from Global Alignment codes from ClustalW

Combine information from local alignment

Primary Library from Local Alignment

codes from Lalign in FASTA package

Take reference from other sequences in sequence set during progressive alignment

Extension Library from Primary Library

refer intermediate sequence method

Three major steps in FASTA:1. build Hashing table2. concatenate matched k-tuple 3. extend to get high score segments

An Overview of T-coffee Method

Primary Library of Alignments (Global and Local)

Library—asetofpairwisealignmentsbetweenallofthesequencestobealigned,andinasequence-to-sequence-position-pairspeci8icweightlistform.alibrarycanbestoredasaN*Nlower(orupper)triangularmatrixwheremaindiagonalcanbeignored,andeachentryisaweightlist.Inotherword,alistofweightedpairwiseconstraints.TheprimarylibraryofglobalalignmentforasequencesetisdenotedbyAGandtheprimarylibraryoflocalalignmentforasequencesetisdenotedbyAL.A*isreferredtoeitherAGorAL.

NowsupposewehaveasequencesetwithsizeN(Nrefertothenumberofsequencesintheset),thetotalnumberofsequencepairsforthesequencesetisN*(N-1)/2.

WecanuseAitodenotetheithsequence(item)inthesequencesetA.SothatinmatrixA*wecanknowentryA*ijwherecontaintheinformationfromthepairofalignmenttheentrydenotes.BeforetogenerateglobalalignmentAGorlocalalignmentAL,weshould8irstdoallpossiblepairwisealignmentsusingglobalalignmentmethodorFASTAlocalsegmentsmatchmethod(Lalign)

*When we do local pairwise alignment, by default, we choose ten top-scoring non-intersections local alignment from each pair of alignment. So the number of segments derived from an alignment is very likely less than 10 (simply because there are no so many qualified matched segments) and could be 0.

Afterthepairwisealignment,wederivedalistofpairwiseresiduematchesforeachentryofA*.AndXmdenotethemthpositioninacertainsequenceAi.Sothelistinanentrycanbedenotedby(XnXm)|A*ij.

Finally, we assign a weight to each pairwise residues match in all lists directed by all entries in A*, and the weight equal to percentage identity of the alignment of Ai to Aj where the pairwise residue match is derived from. W[(Xn, Xm)|A*ij] = P.I.(A*ij). The weight is also referred as constraint.

Primary Library of Alignments (Global and Local)

A1 … Ai Aj … AN

A1

A2

…

Aj %

…

AN

a list of W(Xm,Xn|A*ij)

Library is a generalized list which contains key-list and key-value pairs. List contains key-value pairs. For global alignment:

For local alignment

Produce

Combination of the Libraries: Addition

Pooling the ClustalW and Lalign primary libraries in a simple process of addition:

AGL$=$AG+AL$

W[(Xn, Xm)|AGLij] = W[(Xn, Xm)|AGij]+ W[(Xn, Xm)|ALij] If W[(Xn, Xm)|A*ij] is not recorded in A*, assign 0 to it. Then entry AGLij can be regarded as a ‘sparse’ list with L(i)*L(j) number of key-value pairs (a lot of values are 0). L(i) denote the length(or number of residues) of sequence Ai.

Library can be used as scoring scheme

Library A* can be regarded as sequence-to-sequence-position-pair specific scoring scheme.

It can be regarded as a secondary scoring scheme derived from dynamic programing pairwise alignment using substitution matrix as primary scoring scheme.

Extending the library: Background

Purpose: to take reference from other sequences in each step of progressive alignment.Previous solutions for this purpose: Fitting a set of weighted constraints into a multiple alignment is a well-known problem, formulated by Kececioglu as an instance of the “maximum weight trace”, an NP-complete problem. And two optimizaition strategies were proposed: 1. genetic Algorithm: prohibitive computational time 2. graph-theoretical method: not robust enough for all cases In a word, this problem cannot be illustrated well from graph-theory point of view.

Solution proposed by this paper: a heuristic algorithm inspired from intermediate sequence method. A triplet approach.

Extending the library: Triplet approach

W(A(G), C(?)) W(A(G), C(?))consider seqCconsider seqD W(A(G), D(?)) W(A(G), D(?))

For W(A(G), B(G)) E[W(A(G), B(G))]=W(A(G), B(G))+%d=88+77

If C(?) == C(?): get %(min) of W(A(G), C(?))=77W(A(G), D(?))=100else %(0)

v

v

Sometimes we will get better alignmentIf we don’t strictly follow the guide tree. That is why we take inference from othersequences when align two sequences following the guide tree. Iterative method achieve this goal by modifying guide tree in a heuristic manner.e.g. MUSLE

Extending the library: Let’s code this process

Note the library extension operator as AE and notice that it is not a library that can be added to A* because it is a function of A*. AE(A*)= A*E.

def AE (A*):for i=1, i++, i<=N for j=i+1, j++, j<=N // go through A*ij: C(2,N) for m=1, m++, m<=L(Ai) for n=1, n++, n<=L(Aj) //go through all constraints in the matrix entries: L^2 E=0, for k of each Ak belonging to A-Ai-Aj a = get_position(m i k a) b = get_position(n j k b) if a == b // to find consistent residues in other sequences supporting match of Xm|Ai and Xn|Aj: 2L e1 = W[(Xm, Xa)|A*ik] e2 = W[(Xn, Xb)|A*jk] E +=min{e1, e2} // get extension weight W[(Xm,Xn)|A*ij]+= E // A*E

def get_position(m i k a): for n=1 n++ n<=L(Ak) if W[(Xm,Xn)|A*ik] != 0 add n to a // find the possible consistent position in Ak: L(Ak) return a

C(2,N)* (L^2)*L=O(N^2*L^3)

Extending the library: Let’s formulate this process

AGLE =AE (AG)+AE (AL)

Notice that distributive law is not allowed for operator AE .That is to say: AGLE =AE (AGL)

Conclusion: Coffee Score Scheme

Given any pair of residues from any two sequences in sequence set:

If weight = 0, that residue pairs never supported by global, local or extension triplet alignment. (in other words, the pair of that residues maybe aligned in form of gap).

If weight >0, that weight will reflect a combination of the similarity of the pair of sequences(Global) or sequence segments(Local) that the residue pair comes from and the consistency of match of the residues with residues from other sequences in the sequence set.

The weight library can then be used as coffee score scheme to do progressive alignment.

*When apply Coffee score scheme to do dynamic programming or progressive alignment, there is no need to set additional gap open or gap extension penalty simply for two reasons:

1. Coffee score scheme is a secondary score scheme generated from dynamic programming using primary score scheme, where penalty about gap is already taken account of.

2. Although local alignment primary library doesn’t reflect how the match of pair of residues introduce gaps globally, if this match of pair of residues is also supported by global alignment, gap information will be reflected through global alignment . Otherwise this mach of pair of residues is not going to have high weight if it is not supported by consistency with reference from other sequences. In this case, gap penalty is still not necessary.

In other word the weight reflects how the residue pair is supported by direct local or global alignment within which the residue pair comes from and the indirect alignment with facilitation of all other sequences as intermediate-sequences.

Practically, gap penalty=0.

Progressive Alignment Strategy

Given the Column n

CCCT

+TTT

! !, !!!!!

!!!!!!

!!!= !"#$!%#_1(!1)!!

!

! !, !!!!!

!!!!!!

!!!= !"#$!%#_2(!2)!!

!

CCCTTTT

+CC

Don’t need to align pairs of residues within existing column of alignment , only consider weights of matched pairs of residues between existing column:

!!!!! [!"#!(!), !"#!(!)]!

!!!!!!!! ∗ !!!

= !"#$!%#_3(!3)!!

average_1’=a1+a2+a3

average_2’

Within Within Between

average_3’



Validation & Result

Test Cases is from BaliBase Why Balibase

Reliabitlity:The MSA in Balibase is resulted from manual structure comparison and validated using structure-superposition algorithms SSAP-DALI

Comprehensiveness: 141 MSA cases in Balibase can be grouped into 5 categories: 1. Group with phylogenetically equidistant members 2. Group with one orphan sequence and a group of close relatives 3. Group with two distant subgroups 4. Group in which some members have long terminal insertions 5. Group in which some members have long internal insertions

Thus the cases are unlikely to be biased toward any specific multiple-alignment method.

Validation method: Scoring Scheme and Multimethod Comparison

Scoring Scheme:1. column-wise comparison: get point only when the whole column is aligned correctly 2. SP: sum-of-pairs: get weighted point when the column aligned is partially correct.

Validation is carried out by comparing each calculated multiple alignment with its counterpart in BaliBase.

Multimethod Comparison:Candidate Methods 1. Prrp 2. ClustalW3. MSA & DCA methods eliminated at the very begining4. Dialign2Statistic Method: Wilcoxon signed matched-pair ranked test : non-parametric test which use difference between sums of ranks from two series of data as statistic H0: no difference H1: has differenceif P-value is large, accept H0. Otherwise reject H0.

Result: Extension Library is Superior to Primary Library

Comparison of three types of primary library:

1. ClustalW pair-wise library(C) (extended to CE) 2. Lalign pairwise Library(L) (extended to LE)3. Pooling of the ClustalW and Lalign pairwise libraries(CL) (extended to CLE)

Result: CL > C , CL > L

Comparison of Extension library with Primary library

Result: CE > C , LE > L , CLE > CL

Comparison between three types of Extension Libraries

Result: CLE > CE , CLE > LE

So that we can conclude that CLE is the best library as scoring scheme.

Result: T-Coffee Method is Superior to Other Methods

As comparison with other Methods, two scoring scheme has been separately applied, and for each scoring scheme, two kinds of test has been applied.

Column-wisecore region test T-Coffee > Prrp > ClustalW

complete alignment test T-Coffee > ClustalW> Prrp

Sum of pairscore region test T-Coffee > ClustalW> Prrp

complete alignment test ?

Result: T-Coffee does not always outperform other methods in all specific cases

Thank you!

Questions?

Technology

T coffee algorithm dissection