Sequence Comparison - AlignmentSequence Comparison - Alignment
Alignments can be thought Alignments can be thought of as two sequences of as two sequences differing due to mutations differing due to mutations happened during the happened during the evolutionevolution
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | |
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Scoring AlignmentsScoring Alignments Alignments are based on three basic operations:Alignments are based on three basic operations:
1.1. SubstitutionsSubstitutions
2.2. InsertionsInsertions
3.3. DeletionsDeletions
A score is assigned to each single operation (resulting in a A score is assigned to each single operation (resulting in a scoring matrix and also in gap penalties). Alignments are then scoring matrix and also in gap penalties). Alignments are then scored by scored by adding the scoresadding the scores of their operations. of their operations.
Standard formulations of string alignment optimize the above Standard formulations of string alignment optimize the above score of the alignment.score of the alignment.
An Example Of Scoring an An Example Of Scoring an Alignment Using a Scoring MatrixAlignment Using a Scoring Matrix
AA RR NN KK
AA 55 -2-2 -1-1 -1-1
RR -- 77 -1-1 33
NN -- -- 77 00
KK -- -- -- 66
Scoring Matrices in Practice Scoring Matrices in Practice Some choices for substitution scores are now common, largely due Some choices for substitution scores are now common, largely due
to conventionto convention Most commonly used Amino-Acid substitution matrices:Most commonly used Amino-Acid substitution matrices:
PAM (Percent Accepted Mutation)PAM (Percent Accepted Mutation) BLOSUM (Blocks Amino Acid Substitution Matrix)BLOSUM (Blocks Amino Acid Substitution Matrix)
BLOSUM50 Scoring MatrixBLOSUM50 Scoring Matrix
Gap PenaltiesGap Penalties Inclusion of gaps and gap penalties is necessary Inclusion of gaps and gap penalties is necessary
to obtain the best alignmentto obtain the best alignment
If gap penalty is too high, gaps will never appear If gap penalty is too high, gaps will never appear in the alignmentin the alignment
AATGCTGCAATGCTGC ATGCTGCAATGCTGCA
If gap penalty is too low, gaps will appear If gap penalty is too low, gaps will appear everywhere in the alignmenteverywhere in the alignment
AATGCTGC----AATGCTGC---- A----TGCTGCAA----TGCTGCA
Gap Penalties (Cont’d)Gap Penalties (Cont’d)
Separate penalties for gap opening and gap extensionSeparate penalties for gap opening and gap extension
Opening: The cost to introduce a gapOpening: The cost to introduce a gap
Extension: The cost to elongate a gapExtension: The cost to elongate a gap
Opening a gap is costly, while extending a gap is cheap Opening a gap is costly, while extending a gap is cheap
Despite scoring matrices, no gap penalties are commonly agreed uponDespite scoring matrices, no gap penalties are commonly agreed upon
LETVGYW----L
-5 -1 -1 -1
Parametric Sequence AlignmentParametric Sequence Alignment
For a given pair of strings, the alignment problem is For a given pair of strings, the alignment problem is solved for solved for effectively all possible choiceseffectively all possible choices of the scoring of the scoring parameters and penalties (exhaustive search).parameters and penalties (exhaustive search).
A A correct alignmentcorrect alignment is then used to find the best is then used to find the best parameter values.parameter values.
However, this method is However, this method is very inefficientvery inefficient if the number of if the number of parameters is large.parameters is large.
Inverse Parametric AlignmentInverse Parametric Alignment
INPUT: an alignment of a pair of strings.INPUT: an alignment of a pair of strings.
OUTPUT: a choice of parameters that makes the input OUTPUT: a choice of parameters that makes the input alignment be an optimal-scoring alignment of its strings.alignment be an optimal-scoring alignment of its strings.
From Machine Learning point of view, this learns the From Machine Learning point of view, this learns the parameters for optimal alignment from training examples parameters for optimal alignment from training examples of correct alignments.of correct alignments.
Inverse Optimal AlignmentInverse Optimal Alignment
Definition (Inverse Optimal Alignment): Definition (Inverse Optimal Alignment):
INPUT: alignments INPUT: alignments AA11, A, A22, …, A, …, Akk of strings, of strings,
an alignment scoring function an alignment scoring function ffww with with parameters parameters ww = ( = (ww11, w, w22, …, w, …, wpp). ).
OUTPUT: values OUTPUT: values x x = (= (xx11, x, x22, …, x, …, xpp) for ) for ww
GOAL: each input alignment be an optimal alignment of GOAL: each input alignment be an optimal alignment of its strings under its strings under ffxx . .
ATTENTION: ATTENTION: This problem may have no solution!This problem may have no solution!
Inverse Near-Optimal AlignmentInverse Near-Optimal Alignment
When minimizing the scoring function When minimizing the scoring function f, f, we we say an alignment say an alignment A A of a set of strings of a set of strings S S is is –optimal–optimal, for some if:, for some if:
where is the optimal alignment of where is the optimal alignment of S S under under f.f.
0
)()1()( *AfAf
*A
Inverse Near-Optimal Alignment Inverse Near-Optimal Alignment (Cont’d)(Cont’d)
Definition (Inverse Near-Optimal Alignment):Definition (Inverse Near-Optimal Alignment):
INPUT: alignments INPUT: alignments AAii
scoring function scoring function ff
real number real number
OUTPUT: find parameter values OUTPUT: find parameter values x x
GOAL: each alignment GOAL: each alignment AAii be -optimal under be -optimal under ffxx . .
The smallest possible can be found within accuracy The smallest possible can be found within accuracy using calls to the algorithm.using calls to the algorithm.
0
))/(log( O
Inverse Unique-Optimal AlignmentInverse Unique-Optimal Alignment
When minimizing the scoring function When minimizing the scoring function f, f, we we say an alignment say an alignment A A of a set of strings of a set of strings S S is is -unique-unique for some if: for some if:
for every alignment for every alignment B B of of S S other than other than A.A.
0
)()( AfBf
Inverse Unique-Optimal Alignment Inverse Unique-Optimal Alignment (Cont’d)(Cont’d)
Definition (Inverse Unique-Optimal Alignment): Definition (Inverse Unique-Optimal Alignment):
INPUT: alignments INPUT: alignments AAii
scoring function scoring function ff real number real number
OUTPUT: parameter values OUTPUT: parameter values x x
GOAL: each alignment GOAL: each alignment AAii be -unique under be -unique under ffxx
The largest possible can be found within accuracy using The largest possible can be found within accuracy using calls to the algorithm. calls to the algorithm.
0
))/(log( O
Let There Be Linear Functions …Let There Be Linear Functions …
For most standard forms of alignment, the alignment For most standard forms of alignment, the alignment scoring function scoring function ff is a linear function of its parameters: is a linear function of its parameters:
where each where each ffii measures one of the features of measures one of the features of A.A.
pp wAfwAfAfAf )(...)()(:)( 110
Let There Be Linear Functions … Let There Be Linear Functions … (Example I)(Example I)
With fixed substitution scores, and two With fixed substitution scores, and two parameters gap open ( ) and gap extension parameters gap open ( ) and gap extension ( ) penalties, ( ) penalties, p=2 and:p=2 and:
where:where: g(A) = g(A) = number of gapsnumber of gaps l(A) = l(A) = total length of gapstotal length of gaps s(A) =s(A) = total score of all substitutions total score of all substitutions
)()()()( AlAgAsAf
Let There Be Linear Functions … Let There Be Linear Functions … (Example II)(Example II)
With no parameters fixed, the substitution scores are With no parameters fixed, the substitution scores are also in our parameters and:also in our parameters and:
where:where:
aa and and b b range over all letters in the alphabetrange over all letters in the alphabet
hhabab(A)(A) = # of substitutions in = # of substitutions in A A replacing replacing a a by by bb
ab
)()())(()(,
AlAgAhAfba
abab
Linear Programming ProblemLinear Programming Problem INPUT: variables INPUT: variables x x = (= (xx11, x, x22, …, x, …, xnn))
a system of linear inequalities in a system of linear inequalities in xx a linear objective function in a linear objective function in xx
OUTPUT: OUTPUT: assignment of real values to assignment of real values to xx
GOAL: satisfy all the inequalities and minimize the objectiveGOAL: satisfy all the inequalities and minimize the objective
In general, the program can be In general, the program can be infeasibleinfeasible, , boundedbounded, , or or unboundedunbounded..
*x
}:{minarg0
* bAxcxxx
*x
Reducing The Inverse Alignment Reducing The Inverse Alignment Problems To Linear ProgrammingProblems To Linear Programming
Inverse Optimal Alignment: For each Inverse Optimal Alignment: For each AAii and every and every
alignment alignment B B of the set of the set SSii, we have an inequality:, we have an inequality:
or equivalently:or equivalently:
The number of alignments of a pair of strings of length The number of alignments of a pair of strings of length n n is is hence a total of inequalities in hence a total of inequalities in pp variables. Also, no specific objective function.variables. Also, no specific objective function.
)()( ixx AfBf
pj
ijijj BfAfxAfBf1
00 ))()(())()((
)4()/)23(( 2/1 nn n )4( nk
Separation TheoremSeparation Theorem Some definitions:Some definitions:1.1. Polyhedron:Polyhedron: intersection of half-spaces intersection of half-spaces2.2. Rational polyhedronRational polyhedron: described by inequalities with : described by inequalities with
only rational coefficientsonly rational coefficients3.3. Bounded polyhedronBounded polyhedron: no infinite rays: no infinite rays
Separation Theorem (Cont’d)Separation Theorem (Cont’d)
Optimization ProblemOptimization Problem for a rational polyhedron for a rational polyhedron P in P in : :
INPUT: rational coefficients INPUT: rational coefficients cc specifying the objective specifying the objective
OUTPUT: a point OUTPUT: a point xx in in P P minimizing minimizing cx, cx, or determining or determining that that P P is empty.is empty.
Separation ProblemSeparation Problem for for P P is: is:
INPUT: a point INPUT: a point y in y in
OUTPU: rational coefficients OUTPU: rational coefficients ww and and bb such that such that for all points for all points x x in in P, butP, but (a (a violated inequalityviolated inequality) or ) or determining that determining that y y is in is in P.P.
bwx
dR
dR
bwy
Separation Theorem (Cont’d)Separation Theorem (Cont’d)
Theorem (Equivalence of Separation and Theorem (Equivalence of Separation and Optimization):Optimization): The optimization problem on a The optimization problem on a bounded rational polyhedron can be solved in bounded rational polyhedron can be solved in polynomial time if and only if the separation polynomial time if and only if the separation problem can be solved in polynomial time.problem can be solved in polynomial time.
That is, for bounded rational polyhedrons:That is, for bounded rational polyhedrons:
OptimizationOptimization SeparationSeparation
Cutting-Plane AlgorithmCutting-Plane Algorithm
1.1. Start with a small subset Start with a small subset S S of the set of the set L L of all inequalitiesof all inequalities
2.2. Compute an optimal solution Compute an optimal solution xx under constraints in under constraints in SS
3.3. Call the separation algorithm for Call the separation algorithm for L L on on xx
4.4. If If xx is determined to satisfy is determined to satisfy LL output it and halt; output it and halt; otherwise,otherwise,
add the violated inequality to add the violated inequality to SS and loop back to step (2). and loop back to step (2).
Complexity of Inverse AlignmentComplexity of Inverse Alignment
Theorem:Theorem: Inverse Optimal and Near-Optimal Alignment can Inverse Optimal and Near-Optimal Alignment can be solved in polynomial time for any form of alignment in be solved in polynomial time for any form of alignment in which: which:
1.1. the alignment scoring function is linear the alignment scoring function is linear
2.2. the parameters values can be bounded the parameters values can be bounded
3.3. for any fixed parameter choice, an optimal alignment for any fixed parameter choice, an optimal alignment can be found in polynomial time. can be found in polynomial time.
Inverse Unique-Optimal Alignment can be solved in Inverse Unique-Optimal Alignment can be solved in polynomial time if in addition:polynomial time if in addition:
3’.3’. for any fixed parameter choice, a for any fixed parameter choice, a next-bestnext-best alignment alignment can be found in polynomial time. can be found in polynomial time.
Application to Global AlignmentApplication to Global Alignment
Initializing the Cutting-Plane Algorithm:Initializing the Cutting-Plane Algorithm: We consider We consider the problem in two cases:the problem in two cases:
1.1. All scores and penalties varyingAll scores and penalties varying: Then the parameter : Then the parameter space can be made bounded.space can be made bounded.
2.2. Substitution costs are fixedSubstitution costs are fixed: Then either (1) a : Then either (1) a bounding bounding inequalityinequality, or (2) two inequalities one of which is a , or (2) two inequalities one of which is a downward half-space, the other one is an upward half-downward half-space, the other one is an upward half-space, and the slope of the former is less than the slope space, and the slope of the former is less than the slope of the latter can be found in of the latter can be found in O(1) O(1) timetime, , if they exist.if they exist.
Application to Global Alignment Application to Global Alignment (Cont’d)(Cont’d)
Choosing an Objective Function:Choosing an Objective Function: Again we consider two Again we consider two different cases:different cases:
1.1. Fixed substitution scores:Fixed substitution scores: in this case we choose the in this case we choose the following objective:following objective:
2.2. Varying substitution scores:Varying substitution scores: In this case we choose the In this case we choose the following objective:following objective:
where where ss is the minimum of all non-identity substitution is the minimum of all non-identity substitution scores and scores and i i is the maximum of all identity scores. is the maximum of all identity scores.
}max{
}max{ is
Application to Global Alignment Application to Global Alignment (Cont’d)(Cont’d)
For every objective, two extreme solutions exist: For every objective, two extreme solutions exist: xxlargelarge
and and xxsmallsmall. Then for every we have a . Then for every we have a
corresponding solution:corresponding solution:
xx1/21/2 is expected to better generalize to alignments outside is expected to better generalize to alignments outside
the training set.the training set.
10
smallel xxx .).1(: arg
CONTRAlignCONTRAlign
What:What: extensible and fully automatic parameter learning extensible and fully automatic parameter learning framework for protein pair-wise sequence alignmentframework for protein pair-wise sequence alignment
How:How: pair conditional random fields (pair CRF s) pair conditional random fields (pair CRF s)
Who:Who:
Pair-HMMs for Sequence Pair-HMMs for Sequence AlignmentAlignment
),(),(),(),( .......),,( GGMMI
AIIM
YFMMM
GGMM xxx
yxaP
Pair-HMMs … (Cont’d)Pair-HMMs … (Cont’d)
If then:If then:
where:where:
TMM
GGMM ,...]log,log,[log ),(
w
)),,(exp();,,( yxayxaP Tfww
1
2
1
#
),(#
#
),,(transitionMMfollowsalignmenttimesof
MstateinGGgeneratesalignmenttimesof
Mstateinstartsalignmenttimesof
yxaf
Training Pair-HMMs Training Pair-HMMs
INPUT: a set of training examplesINPUT: a set of training examples
OUTPUT: the feature vector OUTPUT: the feature vector w w
METHOD: METHOD: maximizing the joint log-likelihoodmaximizing the joint log-likelihood of the data of the data and alignments under constraints on and alignments under constraints on ww::
mi
iii yxaD 1)()()( )},,{(
m
i
iii yxaPDl1
)()()( );,,(log:):( ww
Generating Alignments Using Pair-Generating Alignments Using Pair-HMMsHMMs
Viterbi AlgorithmViterbi Algorithm on a Pair-HMM: on a Pair-HMM:
INPUT: two sequences INPUT: two sequences x x and and yy
OUTPUT: the alignment OUTPUT: the alignment a a of of x x and and yy that maximizes that maximizes P(a|P(a|x,y;w)x,y;w)
RUNNING TIME: RUNNING TIME: O(|x|.|y|)O(|x|.|y|)
Pair-CRFsPair-CRFs
Directly model the conditional probabilities:Directly model the conditional probabilities:
where where w w is a real-valued parameter vector is a real-valued parameter vector not not necessarily corresponding to log-probabilitiesnecessarily corresponding to log-probabilities
''
)),,'(exp(
)),,(exp(
);,,'(
);,,();,|(
a
T
T
a
yxa
yxa
yxaP
yxaPyxaP
fw
fw
w
ww
Training Pair-CRFsTraining Pair-CRFs
INPUT: a set of training examplesINPUT: a set of training examples OUTPUT: real-valued feature vector OUTPUT: real-valued feature vector ww METHOD: METHOD: maximizing themaximizing the conditional log-likelihoodconditional log-likelihood of of
the data (discriminative/conditional learning)the data (discriminative/conditional learning)
where is where is a Gaussian prior on a Gaussian prior on ww, to , to prevent over-fitting.prevent over-fitting.
mi
iii yxaD 1)()()( )},,{(
m
i
iii PyxaPDl1
)()()( )(log);,|(log:):( www
)exp()( 2j jjwCP w
Properties of Pair-CRFsProperties of Pair-CRFs
Far weaker independence assumptions than Pair-Far weaker independence assumptions than Pair-HMMsHMMs
Capable of utilizing complex non-independent feature Capable of utilizing complex non-independent feature sets sets
Directly optimizing the predictive ability, ignoring Directly optimizing the predictive ability, ignoring P(x,y); the model to generate the input sequencesP(x,y); the model to generate the input sequences
Choice of Model Topology in Choice of Model Topology in CONTRAlignCONTRAlign
Some possible model topologies:Some possible model topologies:
CONTRAlignCONTRAlignDouble-AffineDouble-Affine : :
CONTRAlignCONTRAlignLocalLocal : :
Choice of Feature Sets in Choice of Feature Sets in CONTRAlignCONTRAlign
Some possible feature sets to utilize:Some possible feature sets to utilize:
1. Hydropathy-based gap context features 1. Hydropathy-based gap context features (CONTRAlign(CONTRAlignHYDROPATHYHYDROPATHY))
2. External Information:2. External Information:
2.1. Secondary structure (CONTRAlign2.1. Secondary structure (CONTRAlignDSSPDSSP))
2.2. Solvent accessibility 2.2. Solvent accessibility (CONTRAlign(CONTRAlignACCESSIBILITYACCESSIBILITY))
Results: Comparison of Model Results: Comparison of Model Topologies and Feature SetsTopologies and Feature Sets
Results: Comparison to Modern Results: Comparison to Modern Sequence Alignment ToolsSequence Alignment Tools
Results: Alignment Accuracy in the Results: Alignment Accuracy in the “Twilight Zone”“Twilight Zone”
For each conservation range, the uncolored bars give accuracies for For each conservation range, the uncolored bars give accuracies for MAFFT(L-INS-i), T-Coffee, CLUSTALW, MUSCLE, and PROBCONS (Bali) MAFFT(L-INS-i), T-Coffee, CLUSTALW, MUSCLE, and PROBCONS (Bali) in that order, and the colored bar indicated the accuracy for CONTRAlign.in that order, and the colored bar indicated the accuracy for CONTRAlign.