Upload
gui-chen
View
125
Download
0
Embed Size (px)
Citation preview
Why do We Need Multiple Sequences Alignment?
Homology Modeling
Phylogenetic reconstruction
Illustrate conserved and variable sites within a family
Can be used to construct profile or HMM to scour databases of distantly related members of the family
When construct MSA, theoretically we should consider evolution and structural relationships within the family. However…
1. Specific expertise knowledge(if not lacking) is hard to be integrated into algorithm
2. General empirical models of protein evolution doesn’t work well with sequences are less than 30% identical
3. Mathematically sound methods is prohibitively demanding in computer resources
That is why we introduce Heuristic method.
A Brief Review of Previous Methods to Construct MSA
ClustalW: Heuristic method
MSA & DCA:
Mathematically sound
Prrp&Muscle: Iterative Method
dynamic programing to build guide tree,
then do progressive alignments
comment: fast but suffers from its greediness
(once a gap always a gap) no local alignment
information, take little reference from other
sequences in sequences set during construct the MSA
simultaneous alignment of all the sequences
Carrilo and Lipman Algorithm,Multipledimensional dynamic
programing
comment: extremely CPU and memory-intensive approach
and not better than other methods in term of alignment
performance, no local alignment information taken
heuristic method combined with iterative Method
comment: an interesting method, neither significantly faster nor align better than
ClustalW , no local alignment information, take some reference from other
sequences in sequences set during construct the MSA
Question: theoretically speaking, why iterative method is not better than ClustalW? and why local alignment information should be taken?
A Brief Review of Previous Methods to Construct MSA
Dialign2: Heuristic method
simultaneous alignment of all sequences but with crude
heuristic local ailignment method
comment: fast but only consider local
alignment information and to some extent consider reference
from all sequences in a sequence set , align poorly in
practice use.
Question: theoretically speaking, why iterative method is not better than ClustalW? And why local alignment information should be taken?
…so the motivation now is to build a method with all merits listed below:
Combine information from global alignment
Combine information from local alignment
Take reference from other sequences in sequence set during alignment
T-CoffeeMethod
Back
Hueristic method with practical computational time
local alignment is more sensitive to domain and motif
reference from other sequences should be taken from other sequences when align conserved part
optimal alignment not equivalent to biological meaningful alignment
An Overview of T-coffee Method
…so how T-coffee satisfy all the merits mentioned earlier?
Requirement solution
Hueristic method with practical computational time Progressive Alignment codes from ClustalW
Combine information from global alignment
Primary Library from Global Alignment codes from ClustalW
Combine information from local alignment
Primary Library from Local Alignment
codes from Lalign in FASTA package
Take reference from other sequences in sequence set during progressive alignment
Extension Library from Primary Library
refer intermediate sequence method
Three major steps in FASTA:1. build Hashing table2. concatenate matched k-tuple 3. extend to get high score segments
Primary Library of Alignments (Global and Local)
Library—asetofpairwisealignmentsbetweenallofthesequencestobealigned,andinasequence-to-sequence-position-pairspeci8icweightlistform.alibrarycanbestoredasaN*Nlower(orupper)triangularmatrixwheremaindiagonalcanbeignored,andeachentryisaweightlist.Inotherword,alistofweightedpairwiseconstraints.TheprimarylibraryofglobalalignmentforasequencesetisdenotedbyAGandtheprimarylibraryoflocalalignmentforasequencesetisdenotedbyAL.A*isreferredtoeitherAGorAL.
NowsupposewehaveasequencesetwithsizeN(Nrefertothenumberofsequencesintheset),thetotalnumberofsequencepairsforthesequencesetisN*(N-1)/2.
WecanuseAitodenotetheithsequence(item)inthesequencesetA.SothatinmatrixA*wecanknowentryA*ijwherecontaintheinformationfromthepairofalignmenttheentrydenotes.BeforetogenerateglobalalignmentAGorlocalalignmentAL,weshould8irstdoallpossiblepairwisealignmentsusingglobalalignmentmethodorFASTAlocalsegmentsmatchmethod(Lalign)
*When we do local pairwise alignment, by default, we choose ten top-scoring non-intersections local alignment from each pair of alignment. So the number of segments derived from an alignment is very likely less than 10 (simply because there are no so many qualified matched segments) and could be 0.
Afterthepairwisealignment,wederivedalistofpairwiseresiduematchesforeachentryofA*.AndXmdenotethemthpositioninacertainsequenceAi.Sothelistinanentrycanbedenotedby(XnXm)|A*ij.
Finally, we assign a weight to each pairwise residues match in all lists directed by all entries in A*, and the weight equal to percentage identity of the alignment of Ai to Aj where the pairwise residue match is derived from. W[(Xn, Xm)|A*ij] = P.I.(A*ij). The weight is also referred as constraint.
Primary Library of Alignments (Global and Local)
A1 … Ai Aj … AN
A1
A2
…
Aj %
…
AN
a list of W(Xm,Xn|A*ij)
Library is a generalized list which contains key-list and key-value pairs. List contains key-value pairs. For global alignment:
For local alignment
Produce
Combination of the Libraries: Addition
Pooling the ClustalW and Lalign primary libraries in a simple process of addition:
AGL$=$AG+AL$
W[(Xn, Xm)|AGLij] = W[(Xn, Xm)|AGij]+ W[(Xn, Xm)|ALij] If W[(Xn, Xm)|A*ij] is not recorded in A*, assign 0 to it. Then entry AGLij can be regarded as a ‘sparse’ list with L(i)*L(j) number of key-value pairs (a lot of values are 0). L(i) denote the length(or number of residues) of sequence Ai.
Library can be used as scoring scheme
Library A* can be regarded as sequence-to-sequence-position-pair specific scoring scheme.
It can be regarded as a secondary scoring scheme derived from dynamic programing pairwise alignment using substitution matrix as primary scoring scheme.
Extending the library: Background
Purpose: to take reference from other sequences in each step of progressive alignment.Previous solutions for this purpose: Fitting a set of weighted constraints into a multiple alignment is a well-known problem, formulated by Kececioglu as an instance of the “maximum weight trace”, an NP-complete problem. And two optimizaition strategies were proposed: 1. genetic Algorithm: prohibitive computational time 2. graph-theoretical method: not robust enough for all cases In a word, this problem cannot be illustrated well from graph-theory point of view.
Solution proposed by this paper: a heuristic algorithm inspired from intermediate sequence method. A triplet approach.
Extending the library: Triplet approach
W(A(G), C(?)) W(A(G), C(?))consider seqCconsider seqD W(A(G), D(?)) W(A(G), D(?))
For W(A(G), B(G)) E[W(A(G), B(G))]=W(A(G), B(G))+%d=88+77
If C(?) == C(?): get %(min) of W(A(G), C(?))=77W(A(G), D(?))=100else %(0)
v
v
Sometimes we will get better alignmentIf we don’t strictly follow the guide tree. That is why we take inference from othersequences when align two sequences following the guide tree. Iterative method achieve this goal by modifying guide tree in a heuristic manner.e.g. MUSLE
Extending the library: Let’s code this process
Note the library extension operator as AE and notice that it is not a library that can be added to A* because it is a function of A*. AE(A*)= A*E.
def AE (A*):for i=1, i++, i<=N for j=i+1, j++, j<=N // go through A*ij: C(2,N) for m=1, m++, m<=L(Ai) for n=1, n++, n<=L(Aj) //go through all constraints in the matrix entries: L^2 E=0, for k of each Ak belonging to A-Ai-Aj a = get_position(m i k a) b = get_position(n j k b) if a == b // to find consistent residues in other sequences supporting match of Xm|Ai and Xn|Aj: 2L e1 = W[(Xm, Xa)|A*ik] e2 = W[(Xn, Xb)|A*jk] E +=min{e1, e2} // get extension weight W[(Xm,Xn)|A*ij]+= E // A*E
def get_position(m i k a): for n=1 n++ n<=L(Ak) if W[(Xm,Xn)|A*ik] != 0 add n to a // find the possible consistent position in Ak: L(Ak) return a
C(2,N)* (L^2)*L=O(N^2*L^3)
Extending the library: Let’s formulate this process
AGLE =AE (AG)+AE (AL)
Notice that distributive law is not allowed for operator AE .That is to say: AGLE =AE (AGL)
Conclusion: Coffee Score Scheme
Given any pair of residues from any two sequences in sequence set:
If weight = 0, that residue pairs never supported by global, local or extension triplet alignment. (in other words, the pair of that residues maybe aligned in form of gap).
If weight >0, that weight will reflect a combination of the similarity of the pair of sequences(Global) or sequence segments(Local) that the residue pair comes from and the consistency of match of the residues with residues from other sequences in the sequence set.
The weight library can then be used as coffee score scheme to do progressive alignment.
*When apply Coffee score scheme to do dynamic programming or progressive alignment, there is no need to set additional gap open or gap extension penalty simply for two reasons:
1. Coffee score scheme is a secondary score scheme generated from dynamic programming using primary score scheme, where penalty about gap is already taken account of.
2. Although local alignment primary library doesn’t reflect how the match of pair of residues introduce gaps globally, if this match of pair of residues is also supported by global alignment, gap information will be reflected through global alignment . Otherwise this mach of pair of residues is not going to have high weight if it is not supported by consistency with reference from other sequences. In this case, gap penalty is still not necessary.
In other word the weight reflects how the residue pair is supported by direct local or global alignment within which the residue pair comes from and the indirect alignment with facilitation of all other sequences as intermediate-sequences.
Practically, gap penalty=0.
Progressive Alignment Strategy
Given the Column n
CCCT
+TTT
! !, !!!!!
!!!!!!
!!!= !"#$!%#_1(!1)!!
!
! !, !!!!!
!!!!!!
!!!= !"#$!%#_2(!2)!!
!
CCCTTTT
+CC
Don’t need to align pairs of residues within existing column of alignment , only consider weights of matched pairs of residues between existing column:
!!!!! [!"#!(!), !"#!(!)]!
!!!!!!!! ∗ !!!
= !"#$!%#_3(!3)!!
average_1’=a1+a2+a3
average_2’
Within Within Between
average_3’
Test Cases is from BaliBase Why Balibase
Reliabitlity:The MSA in Balibase is resulted from manual structure comparison and validated using structure-superposition algorithms SSAP-DALI
Comprehensiveness: 141 MSA cases in Balibase can be grouped into 5 categories: 1. Group with phylogenetically equidistant members 2. Group with one orphan sequence and a group of close relatives 3. Group with two distant subgroups 4. Group in which some members have long terminal insertions 5. Group in which some members have long internal insertions
Thus the cases are unlikely to be biased toward any specific multiple-alignment method.
Validation method: Scoring Scheme and Multimethod Comparison
Scoring Scheme:1. column-wise comparison: get point only when the whole column is aligned correctly 2. SP: sum-of-pairs: get weighted point when the column aligned is partially correct.
Validation is carried out by comparing each calculated multiple alignment with its counterpart in BaliBase.
Multimethod Comparison:Candidate Methods 1. Prrp 2. ClustalW3. MSA & DCA methods eliminated at the very begining4. Dialign2Statistic Method: Wilcoxon signed matched-pair ranked test : non-parametric test which use difference between sums of ranks from two series of data as statistic H0: no difference H1: has differenceif P-value is large, accept H0. Otherwise reject H0.
Result: Extension Library is Superior to Primary Library
Comparison of three types of primary library:
1. ClustalW pair-wise library(C) (extended to CE) 2. Lalign pairwise Library(L) (extended to LE)3. Pooling of the ClustalW and Lalign pairwise libraries(CL) (extended to CLE)
Result: CL > C , CL > L
Comparison of Extension library with Primary library
Result: CE > C , LE > L , CLE > CL
Comparison between three types of Extension Libraries
Result: CLE > CE , CLE > LE
So that we can conclude that CLE is the best library as scoring scheme.
Result: T-Coffee Method is Superior to Other Methods
As comparison with other Methods, two scoring scheme has been separately applied, and for each scoring scheme, two kinds of test has been applied.
Column-wisecore region test T-Coffee > Prrp > ClustalW
complete alignment test T-Coffee > ClustalW> Prrp
Sum of pairscore region test T-Coffee > ClustalW> Prrp
complete alignment test ?