View
220
Download
0
Embed Size (px)
Citation preview
1
Reverse Factor Algorithm
Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen
Speeding up on two string matching algorithms, Algorithmica, Vol.12, 1994, pp.247-267
CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W. and RYTTER, W.
2
Rule 1: The Suffix to Prefix Rule • For a window to have any chance to match a pattern,
in some way, there must be a suffix of the window which is equal to a prefix of the pattern.
3
Basic Ideas• Open a window W with size |P| in the text.
T|P|
W
p
• Find the longest suffix of W is also the prefix of pattern.
T|P|
p
W
Match!
Case 1:
4
T|P|
W
p
Case 2:
T|P|
W
p
T|P|
W
p
Case 3:
|P|
If there is no such suffix, we move W with length |P|.
5
Preprocessing phase
• T=GCATCGGCGAGAGTATACAGTACG
• P=GCAGAGAG
• L(S): a set contains all prefixes of the pattern.
}G,GC,GCA,GCAG,GCAGA,GCAGAG,GCAGAGA, {GCAGAGAG,)( SL
08 7 6 5 4 3 2 1GA GAGG AC
C
C
C A
We construct the suffix automaton of P.
Suffix Automaton
6
Preprocessing: Construct a Suffix Tree
Example:P :GCAGAGAGPR:GAGAGACGSuffixes of PR: GAGAGACG AGAGACG GAGACG AGACG GACG
ACG CG G
G
6
121
A
54
A
2
3
11109
7
8
GA
0
PR: the reversal string of P.
1
8 6 4 7 5 3
2
7
G C A T C G C A G A G A G T A T A C A G T A C G
G C A G A G A G
When there is a match, how do we move the window?
C G
G0
6
121
CG
A
54
CG
GA
CG
A
2
3
GA
11109
CG
7
8
GA
CG
GACG
1
8
6 4 7 53
2
T
P
8
G C A T C G C A G A G A G T A T A C A G T A C G
G C A G A G A G
C G
G0
6
121
CG
A
54
CG
GA
CG
A
2
3
GA
11109
CG
7
8
GA
CG
GACG
1
8
6 4 7 53
2
T
P
9
G C A T C G C A G G C A G T A T A C A G T A C G
G C A G A G A G
T
P
G
6
121
A
54
A
2
3
11109
7
8
GA
01
8
6 4 7 5 3
2
Find the longest suffix of W is also the prefix of pattern.
10
G C A T C G C A G G C A G T A T A C A G T A C G
G C A G A G A G
T
P
G
6
121
A
54
A
2
3
11109
7
8
GA
01
8
6 4 7 5 3
2
11
A Whole Example
• T=GCATCGCAGAGAGTATACAGTACG
• P=GCAGAGAG• First attempt :
G C A T C G C A G A G A G T A T A C A G T A C G
G C A G A G A GShift by: 5 (8 - 3)
G0
6
121
A
54
A
2
3
11109
7
8
GA
1
8
6 4 7 5 3
2
T
P
12
G C A T C G C A G A G A G T A T A C A G T A C G
G C A G A G A G
Second attempt :
Shift by: 7 (8 - 1)
C G
G0
6
121
CG
A
54
CG
GA
CG
A
2
3
GA
11109
CG
7
8
GA
CG
GACG
1
8
6 4 7 53
2
T
P
13
Third attempt:
G C A T C G C A G A G A G T A T A C A G T A C G
G C A G A G A G
Shift by: 7 (8 - 1)
T
P
C G
G0
6
121
CG
A
54
CG
GA
CG
A
2
3
GA
11109
CG
7
8
GA
CG
GACG
1
8
6 4 7 53
2
14
Third attempt:
G C A T C G C A G A G A G T A T A C A G T A C G
G C A G A G A G
T
P
C G
G0
6
121
CG
A
54
CG
GA
CG
A
2
3
GA
11109
CG
7
8
GA
CG
GACG
1
8
6 4 7 53
2
16
Reference• [A90]Algorithms for finding patterns in strings, A. V. Aho, Ha
ndbook of Theoretical Computer Science, Vol. A, Elsevier, Amsterdam, 1990, pp.255-300.
• [A85]The myriad virtues of suffix trees, Apostolico, A., Combinatorial Algorithms on words, NATO Advanced Science Institutes, Series F, Vol. 12, 1985, pp.85-96
• [AG86]The Boyer-Moore-Galil string searching strategies revisited, Apostolico, A. and Giancarlo, R., SIAM, Comput. 15, 1986, pp98-105.
• [BR92]Average running time of the Boyer-Moore-Horspool algorithm, Baeza-Yates, R. A. and Regnier, M. Theoret. Comput. Sci., 1992, pp.19-31.
• [BKR91]Analysis of algorithms and Data Structures, Banachowski, L., Kreczmar, A. and Rytter, W., Addison-Wesley. Reading, MA,1991.