Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and...

Preview:

Citation preview

Linear Time Algorithms for Exact Matching

Book: Algorithms on strings, trees and sequences by Dan Gusfield

Presented by: Amir Anter and Vladimir Zoubritsky

Given a string P called pattern and a longer string T called the text, the exact matching problem is to find all occurrences, if any, of pattern P in text T.

Exact Matching Problem

P=aa and T=abaabaaa P occurs in T 3 times, starting at locations 3,6

and 7.◦ Location 3:

abaabaaa◦ Location 6:

abaabaaa◦ Location 7:

abaabaaa

Please note that the occurrences may overlap, locations 6,7.

Exact Matching Problem - Example

Grep command in Unix:◦ grep apple fruitlist.txt

Internet browsers – Find option.

Biology - Searching for a string in a DNA database.

Articles, online books.

Usage cases and motivation

Google books – example

1. Align the left end of P with the left end of T.2. compares the characters of P and T left to

right until:2.1 A mismatch2.2 P ends – An occurrence of P is reported.

3. P is shifted one place to the right.4. If P’s right end is farther than T’s right end: Finish5.Else Go to 2

Naive Algorithm

Step 1:abaabaaaaa

Step 1.1:abaabaaaaa

Step 1.2:abaabaaaaa

Example: T=abaabaaa P=aa

Step 2:abaabaaa aa

Step 2.1:abaabaaa aa

Example: T=abaabaaa P=aa

Step 3:abaabaaa aa

Step 3.1:abaabaaa aa

Step 3.2:abaabaaa aa

Report match at location 3

Example: T=abaabaaa P=aa

Step 4:abaabaaa aa

Step 4.1:abaabaaa aa

Step 4.2:abaabaaa aa

Example: T=abaabaaa P=aa

Step 5:abaabaaa aa

Step 5.1:abaabaaa aa

Example: T=abaabaaa P=aa

Step 6:abaabaaa aa

Step 6.1:abaabaaa aa

Step 6.2:abaabaaa aa

Report match at location 6

Example: T=abaabaaa P=aa

Step 7:abaabaaa aa

Step 7.1:abaabaaa aa

Step 7.2:abaabaaa aa

Report match at location 7

Example: T=abaabaaa P=aa

Step 8:abaabaaa aa

End

Example: T=abaabaaa P=aa

Let P’s length be n. Let T’s length be m. Number of character comparisons in the worst

case is O(nm). No additional storage is needed. 30 character string search in GenBank (DNA

DB) took more than 4 hours. We will shows a linear lime algorithm, which

improves this time to 10 minutes.

Naive Algorithm - Complexity

Given a string S and a position , let be the length of the longest substring of S that starts at i and matches a prefix of S.

Equivalently: is the length of the longest prefix of S[i..|S|] that matches a prefix of S.

Z function

iZ S

iZ S

1i

aabcaabxaaz

aabcaabxaaz

aabcaabxaaz

aabcaabxaaz

aabcaabxaaz

Example: S=aabcaabxaaz

5 3Z S

6 1Z S

7 0Z S

8 0Z S

9 2Z S

P – pattern of length n.T – text of length m.

Let S = P$T, where $ does not appear in P and in T.S’s length is .

Lets assume we have computed for at a preprocessing stage.

Claim: Any value of i>n+1 such that indentifies anoccurrence of P in T starting at position i-(n+1) of T.

Claim: If P occurs in T starting at position j of T, then

Do we really need $? (Except for USD )

Using Z function to solve the exact matching problem

iZ S 2 1i n m

1n m O m

iZ S n

1n jZ S n

For any position where , Z-box at i is defined as the interval starting at i and ending at .

1 2 3 4 5 6 7 8 9 10 11

Z box

0iZ

1ii Z

1i

a a b c a a b x a a z

- The right-most end of any Z-box that begins up to position i-1. - A substring - some Z-box ending at . - The left end of some .

Z box

i irililZ

S

ir

ir

il ..i iS l r

Z box

a a b c a a b x a a z 1 2 3 4 5 6 7 8 9 10 11

5

6

5

6

7

7

5

5

r

r

l

l

Our task is to compute Z values in linear time.

Let’s find by comparing left to right characters of

and until a mismatch is found. is the length of the matching string.

The Z algorithm

2Z

2..S S 1..S S

2Z

2

2

2

0

0

0

Z

r r

l l

2

2

2

0

2

Z

r r

l l

Let’s assume we have all Z values to k-1.The idea is to use already computed Z values to compute .

The Z algorithm

kZ

2 120

120

120

121

... are known

130

100

k

Z Z

r

l

The Z algorithm

2 120

120

120

121

... are known

130

100

k

Z Z

r

l

121i 120 130r 120 100l 31i

The Z algorithm

121i 120 130r 120 100l 31i

22i

The Z algorithm

121i 120 130r 120 100l 31i 22i

22 3Z

The Z algorithm

121i 120 130r 120 100l 31i 22i

22 1213 3Z Z

The Z algorithm

121i 120 130r 120 100l 31i 22i

x x

Let’s assume

22 1213 3 ?Why Z Z

1214 Z

The Z algorithm

i k

k

Given Z for all 1<i k-1 and the current values of r and l. Compute Z ,update r and l:

1.If k>r then find Z bycomparing thecharacters starting a position k to

the characters starting at position 1 of S, until

k

k

k

k' k k'

mismatch is found.

Set Z to be the length of the match.

If Z >0

r=k+Z

l=k

2. K r

Postion k in contained in a Z-box.

k'=k-l+1

=S ..

2.

Z < β Z =Z ,r and l remain unn

k r

a

k' k k'

changed

2.

Z β Z β ,r and l remain unnchanged

Compare the characters starting at position r+1 of S to thecharacters starting at position 1

until mismatch. Say the mismatch occurs at ch

b

of S

aracter q r+1.

Z

1

k q k

r q

l k

Example - JavaScript

The Z algorithm

The Z algorithm - Correctness

i k

k

Given Z for all 1<i k-1 and the current values of r and l. Compute Z ,update r and l:

1.If k>r then find Z bycomparing thecharacters starting a position k to

the characters starting at position 1 of S, until

k

k

k

k' k k'

mismatch is found.

Set Z to be the length of the match.

If Z >0

r=k+Z

l=k

2. K r

Postion k in contained in a Z-box.

k'=k-l+1

=S ..

2.

Z < β Z =Z ,r and l remain unn

k r

a

k' k k'

changed

2.

Z β Z β ,r and l remain unnchanged

Compare the characters starting at position r+1 of S to thecharacters starting at position 1

until mismatch. Say the mismatch occurs at ch

b

of S

aracter q r+1.

Z

1

k q k

r q

l k

Case 1: K>r

The Z algorithm - Correctness

121k 118r

Case 2.a

The Z algorithm - Correctness

121k 130r 100l 22i

'k

k r

Z

Case 2.b

The Z algorithm - Correctness

121k 130r 100l 22i

'k

k r

Z

Do we really need $ ? By definition, is the length of the longestprefix of S[i..|S|] that matches a prefix of S.

If P length is n, indicates an occurrence

of P in T, in case S=P$T and also in case S=PT.

The answer is no, it terms of correctness.

The Z algorithm – Correctness

iZ S n

iZ S

So why to use $?

Using $ ensures a limit of n for the values of In The algorithm we use some and

to compute the current .

We need only additional space. is not bearable.

The Z algorithm – Space complexity

iZ

'kZ '.. lS k Z

O P

'i l

l

Z n Z n

k Z k n

kZ

O T

iterations

Number of compressions:◦ Each mismatch ends an iteration. Max total of

mismatches for the entire algorithm.◦ Each match increments the value of r at least by

1.◦ number of matches comparisons for

the entire algorithm

The Z algorithm – Time complexityS

S

Sr S

The Z algorithm – Time complexity

k

k

k

1.If k>r then find Z bycomparing thecharacters starting a position k to

the characters starting at position 1 of S, until mismatch is found.

Set Z to be the length of the match.

If Z >0

r=k k+Z

l=k

The Z algorithm – Time complexity

k' k k'

2. K r

Postion k in contained in a Z-box.

k'=k-l+1

=S ..

2.

Z β Z β ,r and l remain unnchanged

Compare the characters starting at position r+1 of S to thecharacters starting at po

k r

b

sition 1

until mismatch. Say the mismatch occurs at character q r+1.

Z

1

k

of S

q k

r q

l k

The Z algorithm – Time complexity

O S O n m

Knuth-Morris-Pratt (KMP) Aho–Corasick string matching algorithm

◦ Is a generalization of KMP.◦ Set of patterns in linear time.

Boyer-Moore ◦ Typically runs in sublinear time.◦ It is used in practice for exact matching.◦ Worst case linear.

Why continue?

Thank You!

Recommended