Upload
yachi
View
20
Download
1
Embed Size (px)
DESCRIPTION
A Sub-quadratic Sequence Alignment Algorithm. a. a. c. g. a. c. g. a. 6. 7. 3. 4. 1. 5. 2. 8. 0. c. 1. t. 2. a. 3. c. 4. g. 5. a. 6. g. 7. a. 8. Global alignment. Alignment graph for S = aacgacga , T = ctacgaga. V( i,j ) = max { - PowerPoint PPT Presentation
Citation preview
A Sub-quadratic Sequence Alignment Algorithm
Global alignment
ag
a
g
c
a
t
c
agcagcaa 31
1
2
3
5
4 65 7 80
7
6
8
2
4
Alignment graph for S = aacgacga, T = ctacgaga
Complexity: O(n2)
V(i,j) = max {V(i-1,j-1) + (S[i], T[j]),V(i-1,j) + (S[i], -),V(i,j-1) + (-, T[j])
}
FOUR RUSSIAN ALGORITHM
UNRESTRICTED SCORING FUNCTION
Main idea: Compress the sequences
• S = aacgacga • T = ctacgaga
0
21 3
4 5
c t a
g g
0
1 3
2
4
a g
c
g
LZ-78: Divide the sequence into distinct words
1 2 3 4
a ac g acg a1 2 3 4 5
c t a cg ag a
Trie Trie
The number of distinct words: )( lognnO
a acg g ac act
3/4 3/2 acg
5/4 5/2aga
2 3 4
1
2
3
4
5
0 1
g
a
gca
agca
aca
ga
ca
Main idea
03
52
1
ag c
t
Trie for T
4g
g
01
23
4
ac
gTrie for S
• Compute the alignment score in each block• Propagate the scores between the adjacent blocks
Main idea
• Compress the sequence into words• Pre-compute the score for each block• Do alignment between blocks
• Note:– Replace normal characters by words– Operate on blocks
COMPRESS THE SEQUENCELZ-78
LZ-78
• S = aacgacga • T = ctacgaga
0
21 3
4 5
c t a
g g
0
1 3
2
4
a g
c
g
LZ-78: Divide the sequence into distinct words
1 2 3 4
a ac g acg a1 2 3 4 5
c t a cg ag a
Trie Trie
The number of distinct words: )( lognnO
LZ-78
• Theorem (Lempel and Ziv):– Constant alphabet sequence S– The maximal number of distinct phrases in S is
O(n/log n).
• Tighter upper bound: O(hn/log n) – h is the entropy factor – a real number, 0 < h 1– Entropy is small sequence is repetitive
COMPUTE THE ALIGNMENT SCORE IN EACH BLOCK
a acg g ac act
3/4 3/2 acg
5/4 5/2aga
2 3 4
1
2
3
4
5
0 1
g
a
gca
agca
aca
ga
ca
Compute the alignment score in each block•
• Given– Input border: I– Block
• Compute– Output border: O
O
g
a
gca
G0
20
1
2 3 4
13
4
55
I
Matrices
• I[i] : is the input border value• DIST[i,j] : weight of the optimal path– From entry i of the input border– To entry j of its output border
• OUT[i,j] : merges the information from input row I and DIST– OUT[i,j]=I[i] + DIST[i,j]
• O[j] = max{OUT[i,j] for i=1..n}
O
g
a
gca
G0
20
1
2 3 4
13
4
55
I
DIST and OUT matrix example
O
g
a
gca
G0
20
1
2 3 4
13
4
55
I
DIST matrix OUT matrixI (input borders)
Block – sub-sequences “acg”, “ag”
0 1 2 3 4 5
I0 0 -1 -2 -3 △ △
I1 -1 -1 -2 -1 -3 △
I2 -2 0 0 1 -1 -3
I3 △ -2 -2 0 -2 -2
I4 △ △ -2 0 -1 -1
I5 △ △ △ -2 -1 0
0 1 2 3 4 5
1 0 -1 -2 - -
1 1 0 1 -1 -
1 3 3 4 2 0
-12 0 0 2 0 0
-13 -13 -1 1 0 0
-14 -14 -14 1 2 3
I0=1
I1=2
I2=3
I3=2
I4=1
I5=3
O0 O1 O2 O3 O4 O5
1 3 3 4 2 3
max col
• For each block, given two sub-sequence S1, S2
• Compute (from scratch) DIST in (n*m) time• Given I and DIST, compute OUT in (n*m) time• Given OUT[i,j], Compute O in (m*n) time
Revise• Compress the sequence• Pre-compute DIST[i,j] for
each block• Compute border values of
each blocks
• Remaining questions– How to compute DIST[i,j]
efficiently?– How to compute O[j] from
I[i] and DIST[i,j] efficiently?
a acg g ac acta
4/4cg
5/4 5/3aga
2 3 4
1
2
3
4
5
0 1
COMPUTE O[J] EFFICIENTLY
Compute O[j] efficiently
• For each block of two sub-sequences S1, S2• Given– I[i]– DIST[i,j]
• Compute– O[j]
DIST and OUT matrix example
O
g
a
gca
G0
20
1
2 3 4
13
4
55
I
DIST matrix OUT matrixI (input borders)
Block – sub-sequences “acg”, “ag”
0 1 2 3 4 5
I0 0 -1 -2 -3 △ △
I1 -1 -1 -2 -1 -3 △
I2 -2 0 0 1 -1 -3
I3 △ -2 -2 0 -2 -2
I4 △ △ -2 0 -1 -1
I5 △ △ △ -2 -1 0
0 1 2 3 4 5
1 0 -1 -2 - -
1 1 0 1 -1 -
1 3 3 4 2 0
-12 0 0 2 0 0
-13 -13 -1 1 0 0
-14 -14 -14 1 2 3
I0=1
I1=2
I2=3
I3=2
I4=1
I5=3
O0 O1 O2 O3 O4 O5
1 3 3 4 2 3
max col
Compute O without explicit OUT
O
g
a
gca
G0
20
1
2 3 4
13
4
55
I
DIST matrix I (input borders)
Block – sub-sequences “acg”, “ag”
0 1 2 3 4 5
I0 0 -1 -2 -3 △ △
I1 -1 -1 -2 -1 -3 △
I2 -2 0 0 1 -1 -3
I3 △ -2 -2 0 -2 -2
I4 △ △ -2 0 -1 -1
I5 △ △ △ -2 -1 0
I0=1
I1=2
I2=3
I3=2
I4=1
I5=3
O0 O1 O2 O3 O4 O5
1 3 3 4 2 3
SMAWK
• Given DIST[i,j], I[i] we can compute O[j] in O(n+m)– Without creating OUT[i,j]
• How? Why?
Why?
• Aggarwal, Park and Schmidt observed that DIST and OUT matrices are Monge arrays.
• Definition: a matrix M[0…m,0…n] is totally monotone if either condition 1 or 2 below holds for all a,b=0…m; c,d=0…n: 1. Convex condition:
M[a,c]M[b,c]M[a,d]M[b,d] for all a<b and c<d.2. Concave condition:
M[a,c]M[b,c]M[a,d]M[b,d] for all a<b and c<d.
How?
• Aggarwal et. al. gave a recursive algorithm, called SMAWK, which can find
all row and column maxima of a totally monotone matrixby querying only O(n) elements of the matrix.
• Why DIST[i,j] is totally monotone?
O
g
a
gca
G0
20
1
2 3 4
13
4
55
I
The concave condition
If b-c is better than a-c, then b-d is better than a-d.
a b
dc
Other problem
• Rectangle problem of DIST
• Set upper right corner of OUT to -• Set lower left corner of OUT to -(n+i-1)*k• Preserve the totally monotone property of
OUT
0 1 2 3 4 5
I0 0 -1 -2 -3 △ △I1 -1 -1 -2 -1 -3 △I2 -2 0 0 1 -1 -3
I3 △ -2 -2 0 -2 -2
I4 △ △ -2 0 -1 -1
I5 △ △ △ -2 -1 0
COMPUTE DIST[I,J] EFFICIENTLY
a acg g ac act
3/4 3/2 acg
5/4 5/2aga
2 3 4
1
2
3
4
5
0 1
g
a
gca
agca
aca
ga
ca
Compute DIST[i,j] for block(5/4)
03
52
1
ag c
t
Trie for T
4g
g
01
23
4
ac
gTrie for S
gca
g
a
gca
g
a
I0
I4 I5I2I3
I1
O3 DIST matrix
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
gca
g
a
gca
g
a
I0
I4 I5I2I3
I1
O3 DIST matrix
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
gca
g
a
gca
g
a
I0
I4 I5I2I3
I1
O3 DIST matrix
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
gca
g
a
gca
g
a
I0
I4 I5I2I3
I1
O3 DIST matrix
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
gca
g
a
gca
g
a
I0
I4 I5I2I3
I1
O3 DIST matrix
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
• Only column m in DIST[i,j] is new
• DIST block can be updated in O(m+n)
MANTAINING DIRECT ACCESS TO DIST TABLE
-3
-1
1
0
0
-2
a a c g a c g actacgaga
Trie for T0
1 3
2
4
g
ga
c
Trie for S0
31
2
54
g
cta
g
2 3 4
12
3
4
5
01
-3
-1
1
0
0
-2
-2
-2
0
-2
-2
-1
-1
0
-2
0
-1
-2-2
-1
-2
-1
-1
-3
-2
-1
0
a a c g a c g actacgaga
Trie for T0
1 3
2
4
g
ga
c
Trie for S0
31
2
54
g
cta
g
2 3 4
12
3
4
5
01
DIST
-3
-1
1
0
0
-2
-2
-2
0
-2
-2
-1
-1
0
-2
0
-1
-2-2
-1
-2
-1
-1
-3
-2
-1
0
a a c g a c g actacgaga
Trie for T0
1 3
2
4
g
ga
c
Trie for S0
31
2
54
g
cta
g
2 3 4
12
3
4
5
01
Complexity
• Assume |S| = |T| = n• Number of words in S, T = O(hn/log n)• Number of blocks in alignment graph O(h2n2/(log n)2)• For each block
– Update new DIST block O(t = size of the border)– Create direct access table O(t)
• Propagating I/O across blocks – SMAWK O(t)
• Sum of the sizes of all borders is O(hn2/log n)• Total complexity: O(hn2/log n)
Other extensions
• Trace• Reducing the space complexity for discrete
scoring• Local alignment
References
• Crochemore, M.; Landau, G. M. & Ziv-Ukelson, M. A sub-quadratic sequence alignment algorithm for unrestricted cost matricesACM-SIAM, 2002, 679-688
• Some pictures from 葉恆青