Upload
anne
View
48
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Aligning Alignments Exactly. By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng. Background Definition Hardness An Exponential time algorithm. Alignments. - PowerPoint PPT Presentation
Citation preview
Aligning Alignments ExactlyAligning Alignments Exactly
By John Kececioglu, Dean StarrettBy John Kececioglu, Dean StarrettCS Dept. Univ. of ArizonaCS Dept. Univ. of Arizona
Appeared in 8Appeared in 8thth ACM RECOME 2004, ACM RECOME 2004,
Presented by Jie MengPresented by Jie Meng
BackgroundBackground DefinitionDefinition HardnessHardness An Exponential time algorithmAn Exponential time algorithm
AlignmentsAlignments
Given two (DNA or Protein) sequences, an Given two (DNA or Protein) sequences, an alignment puts them against each other alignment puts them against each other such that the similar parts are aligned as such that the similar parts are aligned as close as possible, for example:close as possible, for example:
A T – C – T C G C TA T – C – T C G C T- T G - A T G – A T- T G - A T G – A T
There are four kinds of alignments
Match
Insertion;
Deletion;
Mismatch
Scoring AlignmentsScoring Alignments
There are four types of aligned columns:There are four types of aligned columns:– Match – Score Match – Score matchmatch = 0. = 0.
– Mismatch – Score Mismatch – Score mismatchmismatch 0. 0.
– Insertion – Score Insertion – Score insertioninsertion 0. 0.
– DeletionDeletion – Score – Score deletiondeletion 0. 0.
The The scorescore of an alignment is defined to be the of an alignment is defined to be the sumsum of the score of the aligned columns. of the score of the aligned columns.
The goal is to minimize the scoreThe goal is to minimize the score
Gap-costGap-cost
We can extend the score We can extend the score indel indel by by openopen and and extensionextension, then for a gap of size x, we have , then for a gap of size x, we have openopen +x* +x* extensionextension instead of x* instead of x* indel indel ..
AT----CGCTTCAT AT----CGCTTCAT -TGCAT—AT----- -TGCAT—AT-----
openopen +4* +4* extensionextension
Multiple AlignmentsMultiple Alignments
In general we also need compare In general we also need compare multiplemultiple sequences and find the similarities.sequences and find the similarities.
Multiple alignmentMultiple alignment generalizes the generalizes the alignment idea to handle many alignment idea to handle many sequences.sequences.
AT-C-TCGATAT-C-TCGAT -TGCAT--AT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT
Sum-of-Pairs (SP) ScoreSum-of-Pairs (SP) Score
Given a multiple alignment, the Given a multiple alignment, the sum-of-pairssum-of-pairs (SP) (SP) score is given by the sum of the score is given by the sum of the inducedinduced pairwise pairwise alignment scores of each pair in the alignment.alignment scores of each pair in the alignment.
AT-C-TCGATAT-C-TCGAT -TGCAT--AT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT
AT-C-TCGAT -TGCAT--AT AT-C-TCGATAT-C-TCGAT -TGCAT--AT AT-C-TCGAT
-TGCAT--AT ATCCA-CGCT ATCCA-CGCT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT
+ +
BAD NEWSBAD NEWS
Multiple alignment is NP-hardMultiple alignment is NP-hard
One methods is to approximate the One methods is to approximate the optimal value; optimal value;
Progressive alignments Progressive alignments
A problem arised natually: A problem arised natually: Aligning AlignmentsAligning Alignments
Aligning Alignments
Let S be a collection of strings s1, s2, s3…sk, over alphabet ;
An alignment of S is a matrix A with k rows such that:i) Each entry is either a letter or a space;ii) No column is all space;iii) Reading across row i and remove space, we get string si;
Like before, we have three types of aligning score:match, mismatch and substitution;
Aligning Alignments
Given two alignments A with k sequences of length N, B with l sequences of length M, we want to align the columns of A and B;
AT-C-TCGAT-TGCAT--ATATCCA-CGAT
CT-ATTGGAT-TTAT-G--TCTTA-GGGAT
Aligning Alignments
In other word, We treat the columns of A and B as single letters, just like aligning two sequences.
CTGT-T
AT-TGT
C-TG-T--T
-AT--T-GT
Aligning Alignments
The score function is still sum-of-pair, namely
We note that the alignment of Ai’ and Bj’ may contain space in both sequences, so we just remove the space here
Ai’: a----aa-a
Bj’: aaa-a-a-a
ki lj
ji BAD1 1
'' ),(
Aligning Alignments
Without gap cost, aligning alignments is polynomial time solvable. We can apply dynamic programming like we did in aligning sequences; the only difference here is that we align columns.
Aligning Alignments
With gap cost, this problem is NP-complete We can use a reduction from MAX-CUT problem MAX-CUT: Given a graph G=(V, E), and a integer
c, ask whether there is a partition of V: V= L R and , such that the size of the cut is no less than c;
By cut, it means the set of edges which have one end vertex in L and another is in R;
RL
NP-hardnessNP-hardness
• Given an instance of MAX-CUT G=(V,E), V={v1, v2, …vn} and E={e1, e2, … em},and a integer c;
• we construct two multiple alignments A and B over alphabet {0,1}: both A and B has m edge rows and k dummy rows, each edge rows corresponding an edge; A has 2n columns, every two continuous columns correspond a vertex; B has 3n columns, every three continuous columns correspond a vertex;
NP-hardnessNP-hardness
• The dummy rows in A are (0-)n, dummy rows in B are (0--)n;
• As to the edge rows in A: suppose the row for e, and e=(vi, vj), then in columns i and j, there are substring, “-1”, and space elsewhere;
• As to the edge rows in B: suppose the row for e, and e=(vi, vj), (i<j), then in columns i, there is a substring “010”, in columns j, there is a substring “-10”
NP-hardnessNP-hardness
• Simply we let score for match is 0,
score for mismatch is 1,
and gap open cost is 2, gap extension cost is 1
ask whether there is an alignment such that the score is less then d-c;
So we have an instance of Aligning Alignments.
HOMEWORK4HOMEWORK4
• Given a set of multiple alignments {A1, A2, … An}, each Ai is a multiple alignment with ki sequences, without gap cost, is the problem of multiple alignment on those alignments {A1, A2, … An} hard or easy, use the method in this paper to align multiple alignments, i.e. align columns. If hard, prove it; otherwise, give an efficient algorithm and prove complexity and correctness.
Exact Algorithm
The basic idea is still dynamic programming; We have to remember extra information by a set,
so-called shape, S : for each row in a multiple alignment, we record the columns of the right-most letters.
Exact Algorithm
S(i, j)=
B[j])(A[i],1)-j1,-S(i
B[j])(-,1)-jS(i, (A[i],-)j)1,-S(i
0j and 0i }{
0jor 0i {}
Exact Algorithm
C(i,j,t)=min
Where g(A[i], B[j], s) means the total number of gaps initiated by appending column A[i] and B[j] onto an alignment that ends in shape s;
}]),[],,[()],[],[(s)1,-j1,-{C(i min
|}][|*)],[,(s)1,-j{C(i, min
|}][|*),],[(s)j,1,-{C(i min
open
tB[j])(A[i],s&1)-j1,-S(is
extensionopen
tB[j])(-,s&1)-jS(i,s
extensionopen
t(A[i],-)s&j)1,-S(is
BqAp
jqBipADsjBiAg
jBksjBg
iAlsiAg
Exact Algorithm
The optimum value is
The problem here is the number of shapes maybe too many, so in the worst case the time and space complexity is
)},,({],[
snmCMinnmSs
nk ,)23((
nk ,)()23((
2
12
3
22
nk
kkn
n
k