Upload
shannon-dixon
View
219
Download
0
Embed Size (px)
Citation preview
Linear Time Algorithms for Exact Matching
Book: Algorithms on strings, trees and sequences by Dan Gusfield
Presented by: Amir Anter and Vladimir Zoubritsky
Given a string P called pattern and a longer string T called the text, the exact matching problem is to find all occurrences, if any, of pattern P in text T.
Exact Matching Problem
P=aa and T=abaabaaa P occurs in T 3 times, starting at locations 3,6
and 7.◦ Location 3:
abaabaaa◦ Location 6:
abaabaaa◦ Location 7:
abaabaaa
Please note that the occurrences may overlap, locations 6,7.
Exact Matching Problem - Example
Grep command in Unix:◦ grep apple fruitlist.txt
Internet browsers – Find option.
Biology - Searching for a string in a DNA database.
Articles, online books.
Usage cases and motivation
Google books – example
1. Align the left end of P with the left end of T.2. compares the characters of P and T left to
right until:2.1 A mismatch2.2 P ends – An occurrence of P is reported.
3. P is shifted one place to the right.4. If P’s right end is farther than T’s right end: Finish5.Else Go to 2
Naive Algorithm
Step 1:abaabaaaaa
Step 1.1:abaabaaaaa
Step 1.2:abaabaaaaa
Example: T=abaabaaa P=aa
Step 2:abaabaaa aa
Step 2.1:abaabaaa aa
Example: T=abaabaaa P=aa
Step 3:abaabaaa aa
Step 3.1:abaabaaa aa
Step 3.2:abaabaaa aa
Report match at location 3
Example: T=abaabaaa P=aa
Step 4:abaabaaa aa
Step 4.1:abaabaaa aa
Step 4.2:abaabaaa aa
Example: T=abaabaaa P=aa
Step 5:abaabaaa aa
Step 5.1:abaabaaa aa
Example: T=abaabaaa P=aa
Step 6:abaabaaa aa
Step 6.1:abaabaaa aa
Step 6.2:abaabaaa aa
Report match at location 6
Example: T=abaabaaa P=aa
Step 7:abaabaaa aa
Step 7.1:abaabaaa aa
Step 7.2:abaabaaa aa
Report match at location 7
Example: T=abaabaaa P=aa
Step 8:abaabaaa aa
End
Example: T=abaabaaa P=aa
Let P’s length be n. Let T’s length be m. Number of character comparisons in the worst
case is O(nm). No additional storage is needed. 30 character string search in GenBank (DNA
DB) took more than 4 hours. We will shows a linear lime algorithm, which
improves this time to 10 minutes.
Naive Algorithm - Complexity
Given a string S and a position , let be the length of the longest substring of S that starts at i and matches a prefix of S.
Equivalently: is the length of the longest prefix of S[i..|S|] that matches a prefix of S.
Z function
iZ S
iZ S
1i
aabcaabxaaz
aabcaabxaaz
aabcaabxaaz
aabcaabxaaz
aabcaabxaaz
Example: S=aabcaabxaaz
5 3Z S
6 1Z S
7 0Z S
8 0Z S
9 2Z S
P – pattern of length n.T – text of length m.
Let S = P$T, where $ does not appear in P and in T.S’s length is .
Lets assume we have computed for at a preprocessing stage.
Claim: Any value of i>n+1 such that indentifies anoccurrence of P in T starting at position i-(n+1) of T.
Claim: If P occurs in T starting at position j of T, then
Do we really need $? (Except for USD )
Using Z function to solve the exact matching problem
iZ S 2 1i n m
1n m O m
iZ S n
1n jZ S n
For any position where , Z-box at i is defined as the interval starting at i and ending at .
1 2 3 4 5 6 7 8 9 10 11
Z box
0iZ
1ii Z
1i
a a b c a a b x a a z
- The right-most end of any Z-box that begins up to position i-1. - A substring - some Z-box ending at . - The left end of some .
Z box
i irililZ
S
ir
ir
il ..i iS l r
Z box
a a b c a a b x a a z 1 2 3 4 5 6 7 8 9 10 11
5
6
5
6
7
7
5
5
r
r
l
l
Our task is to compute Z values in linear time.
Let’s find by comparing left to right characters of
and until a mismatch is found. is the length of the matching string.
The Z algorithm
2Z
2..S S 1..S S
2Z
2
2
2
0
0
0
Z
r r
l l
2
2
2
0
2
Z
r r
l l
Let’s assume we have all Z values to k-1.The idea is to use already computed Z values to compute .
The Z algorithm
kZ
2 120
120
120
121
... are known
130
100
k
Z Z
r
l
The Z algorithm
2 120
120
120
121
... are known
130
100
k
Z Z
r
l
121i 120 130r 120 100l 31i
The Z algorithm
121i 120 130r 120 100l 31i
22i
The Z algorithm
121i 120 130r 120 100l 31i 22i
22 3Z
The Z algorithm
121i 120 130r 120 100l 31i 22i
22 1213 3Z Z
The Z algorithm
121i 120 130r 120 100l 31i 22i
x x
Let’s assume
22 1213 3 ?Why Z Z
1214 Z
The Z algorithm
i k
k
Given Z for all 1<i k-1 and the current values of r and l. Compute Z ,update r and l:
1.If k>r then find Z bycomparing thecharacters starting a position k to
the characters starting at position 1 of S, until
k
k
k
k' k k'
mismatch is found.
Set Z to be the length of the match.
If Z >0
r=k+Z
l=k
2. K r
Postion k in contained in a Z-box.
k'=k-l+1
=S ..
2.
Z < β Z =Z ,r and l remain unn
k r
a
k' k k'
changed
2.
Z β Z β ,r and l remain unnchanged
Compare the characters starting at position r+1 of S to thecharacters starting at position 1
until mismatch. Say the mismatch occurs at ch
b
of S
aracter q r+1.
Z
1
k q k
r q
l k
Example - JavaScript
The Z algorithm
The Z algorithm - Correctness
i k
k
Given Z for all 1<i k-1 and the current values of r and l. Compute Z ,update r and l:
1.If k>r then find Z bycomparing thecharacters starting a position k to
the characters starting at position 1 of S, until
k
k
k
k' k k'
mismatch is found.
Set Z to be the length of the match.
If Z >0
r=k+Z
l=k
2. K r
Postion k in contained in a Z-box.
k'=k-l+1
=S ..
2.
Z < β Z =Z ,r and l remain unn
k r
a
k' k k'
changed
2.
Z β Z β ,r and l remain unnchanged
Compare the characters starting at position r+1 of S to thecharacters starting at position 1
until mismatch. Say the mismatch occurs at ch
b
of S
aracter q r+1.
Z
1
k q k
r q
l k
Case 1: K>r
The Z algorithm - Correctness
121k 118r
Case 2.a
The Z algorithm - Correctness
121k 130r 100l 22i
'k
k r
Z
Case 2.b
The Z algorithm - Correctness
121k 130r 100l 22i
'k
k r
Z
Do we really need $ ? By definition, is the length of the longestprefix of S[i..|S|] that matches a prefix of S.
If P length is n, indicates an occurrence
of P in T, in case S=P$T and also in case S=PT.
The answer is no, it terms of correctness.
The Z algorithm – Correctness
iZ S n
iZ S
So why to use $?
Using $ ensures a limit of n for the values of In The algorithm we use some and
to compute the current .
We need only additional space. is not bearable.
The Z algorithm – Space complexity
iZ
'kZ '.. lS k Z
O P
'i l
l
Z n Z n
k Z k n
kZ
O T
iterations
Number of compressions:◦ Each mismatch ends an iteration. Max total of
mismatches for the entire algorithm.◦ Each match increments the value of r at least by
1.◦ number of matches comparisons for
the entire algorithm
The Z algorithm – Time complexityS
S
Sr S
The Z algorithm – Time complexity
k
k
k
1.If k>r then find Z bycomparing thecharacters starting a position k to
the characters starting at position 1 of S, until mismatch is found.
Set Z to be the length of the match.
If Z >0
r=k k+Z
l=k
The Z algorithm – Time complexity
k' k k'
2. K r
Postion k in contained in a Z-box.
k'=k-l+1
=S ..
2.
Z β Z β ,r and l remain unnchanged
Compare the characters starting at position r+1 of S to thecharacters starting at po
k r
b
sition 1
until mismatch. Say the mismatch occurs at character q r+1.
Z
1
k
of S
q k
r q
l k
The Z algorithm – Time complexity
O S O n m
Knuth-Morris-Pratt (KMP) Aho–Corasick string matching algorithm
◦ Is a generalization of KMP.◦ Set of patterns in linear time.
Boyer-Moore ◦ Typically runs in sublinear time.◦ It is used in practice for exact matching.◦ Worst case linear.
Why continue?
Thank You!