1
Approximate String Matching Using Compressed Suffix Arrays
Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249
Advisor: Prof. R. C. T. Lee
Speaker: C. W. Lu
2
• Let x and y be two strings. Edit distance d(x, y) is the minimum number of character insertions, deletions, and replacements to covert string x to y.
• k-difference string matching problem:– Given a text T with length n, a pattern P with lengt
h m, and an error bound k.– Find all position i of T such that there exists an suf
fix S of T(1, i), d(S, P) ≦ k.
3
• The approach of this paper is as the follows:
• Given a pattern P and an error bound k, we generate all possible P’s which contain (≦k) errors deduced from P.
• Then we conduct an exact match of all such P’s against T.
4
• Example:
T=abbaaa,
P=aba and k=1.
From P and k, we generate the following P’s:
ba, aaba, baba, bba, aa, abba, aaa, ab, abaa, abb, aba.
5
• Then we conduct an exact matching of all P’s against T. Any success indicates that there is a substring S in T such that d(S,T)≦k.
• How can we generate all P’s which we want?
• We use the following observation.
6
T
P
S2
Let S be a substring of T, and S= S1S2.
P = P1P2.
If d(S1, P1) ≦k, and Dist(S2, P2) = 0,
d(S, P) ≦ k.
S1
S
P1 P2
7
Example:
T A C A C A A A A A C A C C
1 2 3 4 5 6 7 8 9 10 11 12 13
A G A B C AP1 2 3 4 5 6
k = 2
Consider the substring S = T(6, 11) = AAAACA,
Let S1 = T(6, 9) = AAAA, and S2 = T(10, 11) = CA.
Dist(S1, P1) = 2 ≦k, and Dist(S2, P2) = 0.
We have Dist(S, P) = 2 ≦k.
S1
P1
S2
P2
8
Example:
T A C A C A A A A A C A C C
1 2 3 4 5 6 7 8 9 10 11 12 13
A G A B C AP1 2 3 4 5 6
k = 2
Consider the substring S = T(8, 11) = AACA,
Let S1 = T(8, 9) = AA, and S2 = T(10, 11) = CA.
Dist(S1, P1) = 2 ≦k, and Dist(S2, P2) = 0.
We have Dist(S, P) = 2 ≦k.
S1
P1
S2
P2
9
• Based upon the above observation, we can generate all edited pattern P’s by editing the prefix and keeping the suffix untouched, in some manner.
• Consider P=aba, k=1.
10
• P=aba, k=1.
P = aba
ba (Deletion) k = 1
i = 1 aaba (Insertion) k = 1baba (Insertion) k = 1
bba (Substution) k = 1
aba k = 0
i = 2
aa (Deletion) k = 1aaba (Insertion) k = 1abba (Insertion) k = 1
aaa (Substution) k = 1
aba k = 0ab (Deletion) k = 1abaa (Insertion) k = 1abba (Insertion) k = 1
abb (Substution) k = 1
aba k = 0
i = 3
i = 4
abaa (Insertion) k = 1abab (Insertion) k = 1
11
• P=aba, k=2.
P = aba
ba (Deletion) k = 1
i = 1 aaba (Insertion) k = 1baba (Insertion) k = 1
bba (Substution) k = 1
aba k = 0
i = 2
aa (Deletion) k = 1aaba (Insertion) k = 1abba (Insertion) k = 1
aaa (Substution) k = 1
aba k = 0ab (Deletion) k = 1abaa (Insertion) k = 1abba (Insertion) k = 1
abb (Substution) k = 1
aba k = 0
i = 3
i = 4
abaa (Insertion) k = 1abab (Insertion) k = 1
12
• P=aba, k=2.
ba
(k = 1)
a (Deletion) k = 2i = 2 aba (Insertion) k = 2
bba (Insertion) k = 2
aa (Substution) k = 2
ba k = 1
i = 3
b (Deletion) k = 2baa (Insertion) k = 2bba (Insertion) k = 2
bb (Substution) k = 2
ba k = 1
i = 4
baa (Insertion) k = 2
bab (Insertion) k = 2
13
For i=1 to m+1
PL’ PR’P’
k’=Dist(PL’, PL)≦k.
Dist(PR’, PR) = 0
iPL’ PR’
P’
iPL
PR
P
Deletion, k’++
A
PL’ PR’
P’
CP’…
Replacement , k’++
A
PL’ PR’
P’
CP’…
Insertion, k’++
PL’ PR’
P’ No operation.
i
Terminate if k’ > k.
14
• Our problem now becomes the following: Given a pattern P, we produce a modified pattern P’. Our job is to determine whether P’ exactly matches some substring of T or not.
• For example, Suppose P=aba. We have ba as one of the modified patterns. So, we like to find out whether ba matches exactly with a substring in T.
15
• This exact matching can be found by using the suffix array and the inverse suffix array.
16
Suffix Array
• Let , where t0, t1, …tn-1 an alphabet A and tn=$ is a special symbol that is not in A and smaller than any symbol in A.
• The jth suffix of T is defined as T(j, n) = tj…tn and is denoted by Tj.
• The suffix array SA[0..n] of T is an array of integers j that represent suffix Tj and the integers are sorted in lexicographic order of corresponding suffixes.
nn- t...tttT 110
17
Example:
T G A C A G T T C G $
0 1 2 3 4 5 6 7 8 9
Suffixes of T:
{GACAGTTCG$, ACAGTTCG$, CAGTTCG$, AGTTCG$, GTTCG$, TTCG$, TCG$, CG$, G$, $}
Lexicographic order:
$, ACAGTTCG$, AGTTCG$, CAGTTCG$, CG$, G$, GACAGTTCG$, GTTCG$, TCG$, TTCG$.
= T9, T1, T3, T2, T7, T8, T0, T4, T6, T5
SA[i]
9 1 3 2 7 8 0 4 6 5
0 1 2 3 4 5 6 7 8 9i
18
Inverse Suffix Array
• The inverse suffix array of T is denoted as SA-1[i].• SA-1[i] equals the number of suffix which are
lexicographically smaller then Ti.
19
Example:
T G A C A G T T C G $
0 1 2 3 4 5 6 7 8 9
Lexicographic order: $
(T9)ACAGTTCG$ (T1)AGTTCG$
(T3)CAGTTCG$
(T2)CG$
(T7)G$
(T8)GACAGTTCG$
(T0)GTTCG$
(T4)TCG$
(T6)TTCG$.
(T5)
SA[i]9
1
3
2
7
8
0
4
6
5
0
1
2
3
4
5
6
7
8
9
i SA-1[i]6
1
3
2
7
9
8
4
5
0
SA-1[SA[x] ] = x.
SA-1[0]=6 because there are 6 suffixes smaller than T0=
GACAGTTCG.
20
• The size of SA and SA-1 are O(nlogn) bits. Both data structures can be constructed in linear time[13, 15, 17].
21
• In this paper, an interval [st..ed] is called the range of the suffix array of T corresponding to a string P if [st..ed] is the largest interval such that P is a prefix of every suffix Tj for j = SA[st], SA[st+1], …, SA[ed].
We write [st..ed ] = range(T, P).
22
Example:
T G A C A G T T C G $
0 1 2 3 4 5 6 7 8 9
Lexicographic order: $
(T9)ACAGTTCG$ (T1)AGTTCG$
(T3)CAGTTCG$
(T2)CG$
(T7)G$
(T8)GACAGTTCG$
(T0)GTTCG$
(T4)TCG$
(T6)TTCG$.
(T5)
SA[i]9
1
3
2
7
8
0
4
6
5
0
1
2
3
4
5
6
7
8
9
i P = G.
G is a prefix of T8, T0 and T4.
T8 = TSA[5]
T0 = TSA[6]
T4 = TSA[7]
st=5, ed=7,
range(T, P) = [5..7].
23
Lemma 1 (Gusfild [12])
Given a text T together with its suffix array, assume [st..ed] = range(T, P). Then, for any character c, the interval[st’..ed’] = range(T, Pc) can be computed in O(logn) time.
24
Lemma 2
Given the interval [st1..ed1] = range(T , P1) and the interval [st2..ed2] = range(T , P2), we can find the interval [st..ed] = range(T , P1P2) in O(logn) time using the suffix array and the inverse suffix array of T.
25
Let [st1..ed1] = range(T , P1),
[st2..ed2] = range(T , P2),
[st..ed] = range(T , P1P2).
[st..ed] is a subinterval of [st1..ed1].
26
Example:
T G A C A G T T C G $
0 1 2 3 4 5 6 7 8 9
Lexicographic order: $
(T9)ACAGTTCG$ (T1)AGTTCG$
(T3)CAGTTCG$
(T2)CG$
(T7)G$
(T8)GACAGTTCG$
(T0)GTTCG$
(T4)TCG$
(T6)TTCG$.
(T5)
SA[i]9
1
3
2
7
8
0
4
6
5
0
1
2
3
4
5
6
7
8
9
iP1 = G. P2 = A.
range(T, P1) = [5..7].
range(T, P1P2) must be
within [5..7].
How can we find the
exact interval with [5..7]?
27
• By the definition of suffix array, the lexicographic order of are increasing.
• The lexicographic order of
are also increasing.
][]1[][ 111 edSAstSAstSA , ..., T, TT
||][||]1[||][ 111111 PedSAPstSAPstSA , ..., T, TT
28
Lexicographic order: $
(T9)ACAGTTCG$ (T1)AGTTCG$
(T3)CAGTTCG$
(T2)CG$
(T7)G$
(T8)GACAGTTCG$
(T0)GTTCG$
(T4)TCG$
(T6)TTCG$.
(T5)
T2 = CAGTTCG$
T2+1 = T3 = AGTTCG$
T2+1 is obtained by deleting the prefix with length 1 from T2.
In general, Ti+1 can be obtained by deleting the prefix with length 1 from Ti.
29
Example:T G A C A G T T C G $
0 1 2 3 4 5 6 7 8 9
Lexicographic order: $
(T9)ACAGTTCG$
(T1)AGTTCG$
(T3)CAGTTCG$
(T2)CG$
(T7)G$
(T8)GACAGTTCG$
(T0)GTTCG$
(T4)TCG$
(T6)TTCG$. (T5)
SA[i]9
1
3
2
7
8
0
4
6
5
0
1
2
3
4
5
6
7
8
9
i P1 = G. P2 = A.
range(T, P1) = [5..7].
][]1[][ 111 edSAstSAstSA , ..., T, TT
T8 < T0 < T4
||][||]1[||][ 111111 PedSAPstSAPstSA , ..., T, TT
T8+1, T0+1, T4+1
T9 < T1 < T5
30
• The lexicographic order of
are also increasing.
• Thus
• To find st and ed, we find the smallest st such that and the largest ed such that
||][||]1[||][ 111111 PedSAPstSAPstSA , ..., T, TT
|]|][[ ... |]|1][[ |]|][[ 11-1
11-1
11-1 PedSASAPstSASAPstSASA
21-1
2 |]|][[ edPstSASAst . |]|][[ 21
-12 edPedSASAst
31
Example:T G A C A G A T C G $
0 1 2 3 4 5 6 7 8 9
Lexicographic order: $
(T9)ACAGTTCG$
(T1)AGTTCG$
(T3)ATCG$. (T5)CAGTTCG$
(T2)CG$
(T7)G$
(T8)GACAGTTCG$
(T0)GATCG$
(T4)TCG$
(T6)
SA[i]9
1
3
5
2
7
8
0
4
6
0
1
2
3
4
5
6
7
8
9
i P1 = G. P2 = A.
range(T, P1) = [6..8].
6 ≦ st, ed ≦ 8
SA-1[i]7
1
4
2
8
3
9
5
6
0
range(T, P2) = [1..3].
range(T, P1P2) = [st..ed].
st = 7 and ed = 8.
3 1 1 1, 1][7][-1 SASA
3 3 1 3, 1][8][-1 SASA
1 0 0, 1][6][-1 SASA
32
• To find the interval of the first character of P:
We construct an array C such that for any c in A, C[c] stores the total number of occurrences of all c’ in T, where c’ ≦ c.
range(T, p1) = [C[c2]+1 … C[c]] where c2 is a character immediately before c in A.
33
Example:T G A C A G T T C G $
0 1 2 3 4 5 6 7 8 9
Lexicographic order: $
(T9)ACAGTTCG$
(T1)AGTTCG$
(T3)CAGTTCG$
(T2)CG$
(T7)G$
(T8)GACAGTTCG$
(T0)GTTCG$
(T4)TCG$
(T6)TTCG$. (T5)
SA[i]9
1
3
2
7
8
0
4
6
5
0
1
2
3
4
5
6
7
8
9
i
P = GACAGCA
C[A] = 2
C[C] = 4
C[G] = 7
C[T] = 9
range(T, p1)
= [C[C]+1…C[G] ]
= [5…7].
34
• Lemma 3
Given the suffix array and the inverse suffix array of T, assume [st..ed] = range(T, P). For any character c, assume we have in advance the array C, we can find the interval [st’..ed’] = range(T, cP) in O(logn) time.
35
I Construct Fst [1..m+1] and Fed [1..m+1] such that [Fst [i]..Fed [i]]= range(T ,P[i..m]).II Call kapproximate([0..n], 1, 0, ε, ε).
kapproximate([s’..e’], i, k’, PL’, Υ )begin 1. Given [Fst [i]..Fed [i]] = range(T , P[i..m]) and [s’..e’] = range(T , PL’), by Lemma 2 find [st..ed] = range(T , PL’P[i..m]). 2. Report occurrences of P∗ = PL’P[i..m] in [st..ed] if the interval exists. 3. If (k’ = k) return. 4. For j :=i to m+1 (a) (when j ≦m, deletion at j) Call kapproximate([s’..e’], j+1, k’+1, PL’, dΥ). (b) (when j ≦ m, replacement at j ) for each c in A i. Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’c). ii. Call kapproximate([s’’..e’’], j+1, k’+1, PL’c, rΥ). (c) (insertion at j) for each c in A i. Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’c). ii. Call kapproximate([s’’..e’’], j, k’+1, PL’c, iΥ). (d) (when j≦m) Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’P[j]). s’ := s’’; e’ := e’’; PL’ := PL’P[j]; Υ := uΥ;end
36
• After an O(n) time preprocessing the text T into an O(nlogn)-bit data structure, the algorithm solves the k-difference problem in O(|A|kmklogn + outputtime) time.
37
References
• [1] A. Amir, D. Keselman, G.M. Landau, M. Lewenstein, N. Lewenstein, M. Rodeh, Indexing and dictionary matching with one error, in: Proc.
• Sixth WADS, Lecture Notes in Computer Science, vol. 1663, Springer, Berlin, 1999, pp. 181–192.
• [2] A. Amir, M. Lewenstein, Ely. Porat, Faster algorithms for string matching with k mismatches, in: Proc. 11th Ann. ACM-SIAM Symp. on
• Discrete Algorithms, 2000, pp. 794–803.• [3] R.A. Baeza-Yates, G. Navarro, A faster algorithm for approximate string matching, in: Pro
c. Seventh Ann. Symp. on Combinatorial Pattern• Matching (CPM’96), pp. 1–23.• [4] R.A. Baeza-Yates, G. Navarro, A practical index for text retrieval allowing errors, in: CLE
I, vol. 1, November 1997, pp. 273–282.• [5] R. Boyer, S. Moore, A fast string matching algorithm, CACM 20 (1977) 762–772.• [6] A.L. Buchsbaum, M.T. Goodrich, J. Westbrook, Range searching over tree cross products.
in: ESA 2000, pp. 120–131.• [7] A. Cobbs, Fast approximate matching using suffix trees. in: Proc. Sixth Ann. Symp. on Co
mbinatorial Pattern Matching (CPM’95), Lecture• Notes in Computer Science, vol. 807, Springer, Berlin, 1995, pp. 41–54.• [8] R. Cole, L.A. Gottlieb, M. Lewenstein, Dictionary matching and indexing with errors and
don’t cares, in: Proc. 36th Ann. ACM Symp. on• Theory of Computing, 2004, pp. 91–100.• [9] P. Ferragina, G. Manzini, Opportunistic data structures with applications, in: Proc. 41st IE
EE Symp. on Foundations of Computer Science• (FOCS’00), 2000, pp. 390–398.
38
• [10] G. Gonnet, A tutorial introduction to computational biochemistry using Darwin, Technical Report, Informatik E.T.H., Zurich, Switzerland,
• 1992.• [11] R. Grossi, J.S. Vitter, Compressed suffix arrays and suffix trees with applications to text i
ndexing and string matching, in: Proc. 32nd ACM• Symp. on Theory of Computing, 2000, pp. 397–406.• [12] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Compu
tational Biology, Cambridge University Press,• Cambridge, 1997.• [13] W.K. Hon, K. Sadakane,W.K. Sung. Breaking a time-and-space barrier in constructing fu
ll-text indices, in: Proc. IEEE Symp. on Foundations• of Computer Science, 2003.• [14] P. Jokinen, E. Ukkonen, Two algorithms for approximate string matching in static texts. i
n: Proc. MFCS’91, Lecture Notes in Computer Science,• vol. 520, Springer, Berlin, 1991, pp. 240–248.• [15] D.K. Kim, J.S. Sim, H. Park, K. Park, Linear-time construction of suffix arrays, in: CPM
2003, pp. 186–199.• [16] D.E. Knuth, J. Morris, V. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (197
7) 323–350.• [17] P. Ko, S. Aluru, Space efficient linear time construction of suffix arrays. in: CPM 2003, p
p. 200–210.• [18] G.M. Landau, U. Vishkin, Fast parallel and serial approximate string matching, J. Algorit
hms 10 (1989) 157–169.• [19] U. Manber, G. Myers, Suffix arrays: a new method for on-line string searches, SIAM J. C
omput. 22 (5) (1993) 935–948.
39
• [20] E.M. MCreight, A space economical suffix tree construction algorithm, J. ACM 23 (2) (1976) 262–272.
• [21] G. Navarro, A guided tour to approximate string matching, ACM Comput. Surveys 33 (1) (2001) 31–88.
• [22] G. Navarro, R.A. Baeza-Yates, A new indexing method for approximate string matching, in: Proc. 10th Ann. Symp. on Combinatorial Pattern
• Matching (CPM’99), pp. 163–185.• [23] G. Navarro, R.A. Baeza-Yates, A hybrid indexing method for approximate string matchin
g, J. Discrete Algorithms 1 (1) (2000) 205–239 18.• [24] G. Navarro, R. Baeza-Yates, E. Sutinen, J. Tarhio, Indexing methods for approximate stri
ng matching, IEEE Data Eng. Bull. 24 (4) (2001)• 19–27.• [25] G. Navarro, E. Sutinen, J. Tanninen, J. Tarhio, Indexing text with approximate q-grams, i
n: Proc. 11th Ann. Symp. on Combinatorial Pattern• Matching, Lecture Notes in Computer Science, vol. 1848, Springer, Berlin, 2000.• [26] K. Sadakane, T. Shibuya, Indexing huge genome sequences for solving various problems,
Genome Informatics 12 (2001) 175–183.• [27] F. Shi, Fast approximate string matching with q-blocks sequences, in: Proc. Third South
American Workshop on String Processing (WSP’96),• Carleton University Press, 1996.• [28] E. Sutinen, J. Tarhio, Filtration with q-samples in approximate string matching. in: Proc.
Seventh Ann. Symp. on Combinatorial Pattern Matching• (CPM’96), pp. 50–63.• [29] E. Ukkonen, Approximate matching over suffix trees, in: Proc. Combinatorial Pattern Ma
tching 1993, vol. 4, Springer, Berlin, June 1993,• pp. 228–242.• [30] R.A. Wagner, M.J. Fischer, The string-to-string correction problem, J. ACM 21 (1974) 16
8–173.
40
Thank you!