38
Łódź, 2008 Intelligent Text Processing lecture 2 Multiple and approximate string matching. Full-text indexing: suffix tree, suffix array Szymon Grabowski [email protected] http://szgrabowski.kis.p.lodz.pl/IPT0 8/

Łódź, 2008

  • Upload
    kioko

  • View
    39

  • Download
    0

Embed Size (px)

DESCRIPTION

Intelligent Text Processing lecture 2 Multiple and approximate string matching. Full-text indexing: suffix tree, suffix array. Szymon Grabowski [email protected] http://szgrabowski.kis.p.lodz.pl/IPT08/. Łódź, 2008. Multiple string matching: problem statement and motivation. - PowerPoint PPT Presentation

Citation preview

Page 1: Łódź, 2008

Łódź, 2008

Intelligent Text Processinglecture 2

Multiple and approximate string matching.Full-text indexing: suffix tree, suffix array

Szymon [email protected]

http://szgrabowski.kis.p.lodz.pl/IPT08/

Page 2: Łódź, 2008

2

Multiple string matching:problem statement and motivation

Sometimes we have a set of patterns P1 , ..., Pr and the task is to find all the occurrences

of any Pi (i=1..r) in T.

Trivial approach: run an exact string matching alg. r times. Ways too slow, even if r moderate.

(Selected) applications:

• batched query handling in a text collection,

• looking for a few spelling variants of a word / phrase(e.g., P1 = “color” and P2 = “colour”),

• anti-virus software (search for virus signatures).

Page 3: Łódź, 2008

3

Adapting the Boyer–Moore approach to multiple string matching

BMH used a skip table d to performthe longest safe pattern shift guided by a single

char only.

Having r patterns, we can perform skips, too.But they’ll be shorter, typically.

Example: P1 = bbcac, P2 = abbcc, T = abadbcac...

5th char of T is b, we shift all the patterns by 2 chars(2 = min(2,3)).

Page 4: Łódź, 2008

4

Adapting the Boyer–Moore approach to multiple string matching, example

Let’s continue this example.

Verifications needed. How? If we compare the text area with all

patterns one by one, this will be too slow if the # of patterns is tens or more.

We can do it better...

E.g. with a trie.

Page 5: Łódź, 2008

5

Trie (aka digital tree)(Fredkin, 1960)

Etymology: reTRIEval (pronounce like try, to distinguish from tree)

http://www.cs.cityu.edu.hk/~deng/5286/L5.ppt

A trie housing the keys:

an, ant, all, allot, alloy, aloe, are, ate,

be

Page 6: Łódź, 2008

6

Trie design dilemma

Natural tradeoff between search timeand space occupancy.

If only pointers from the “existing” chars in a node are kept, it’s more space-efficient but time spent in a node

is O(log ) (binary search in a node).Note: binary search is good in theory

(for the worst case), but usually bad in practice(apart from top trie levels / large alphabets?).

The time per node can be improved to O(1) (a single lookup) if each node takes O() space.

In total, pattern search takes either O(m log ) or O(m) worst case time.

Page 7: Łódź, 2008

7

Let’s trie to do it better...

In most cases tries require a lot of space.

A widely-used improvement: path compression, i.e., combining every non-branching

node with its child = Patricia trie (Morrison, 1968).

Other ideas: using smartly only one bit per pointer, or one pointer for all the children of a node.

PATRICIA stands for Practical Algorithm To Retrieve InformationCoded in Alphanumeric

Page 8: Łódź, 2008

8

Rabin–Karp algorithmcombined with binary search

(Kytöjoki et al., 2003)

From the cited paper:

Preprocessing: hash values for all patterns are calculated and stored in an ordered table.

Matching can then be done by calculating the hash value for each m-char string of the text and searching the

ordered table for this hash value using binary search. If a matching hash value is found, the corresponding

pattern is compared with the text.

Page 9: Łódź, 2008

9

Rabin–Karp alg combined with binary search, cont’d(Kytöjoki et al., 2003)

Kytöjoki et al. implemented this method for m = 8, 16, and 32.

The hash values for patterns of m = 8:A 32bit int is formed of the first 4 bytes of the pattern and

another from the last 4 bytes. These are then XOR’ed together resulting in the following hash function:

Hash(x1 ... x8) = x1x2x3x4 ^ x5x6x7x8

The hash values for m = 16:

Hash16(x1 ... x16) = x1x2x3x4 ^ x5x6x7x8 ^ x9x10x11x12 ^ x13x14x15x16

Hash32 analogously.

Page 10: Łódź, 2008

10

Approximate string matching

Exact string matching problems are quite simpleand almost closed in theory

(new algorithms appear but most of them are useful heuristics rather than setting new achievements for the theory).

Approximate matching, on the other hand, is still a very active research area.

Many practical notions of “approximateness” proposed, e.g., for tolerating typos in text,

false notes in music scores, variations (mutations)of DNA sequences, music melodies transposed to

another key, etc. etc.

Page 11: Łódź, 2008

11

Edit distance(aka Levenshtein distance)

One of the most frequently used measures in string matching.

edit(A, B) is the min number of elementary operationsneeded to convert A into B (or vice versa).

Those allowed basic operations are:

• insert a single char,

• delete a single char,

• substitute a char.

Example: edit(pile, spine) = 2 (insert s; replace l with n).

Page 12: Łódź, 2008

12

Edit distance recurrence

We want to compute ed(A, B). The dynamic programming algorithm

is to fill the matrix C0..|A|, 0..|B| , where Ci,j holds the min number of operations to convert A1..i into B1..j.

The formulas are:

Ci,0 = i

C0,j = j

Ci,j = if (Ai = Bj) then Ci–1,j–1

else 1 + min(Ci–1,j, Ci,j–1, Ci–1,j–1)

Page 13: Łódź, 2008

13

A = surgery, B = survey(A widely used example, e.g. from Gonzalo Navarro’s PhD, 1998

ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/thesis98.ps.gz)

DP for edit distance, example

Page 14: Łódź, 2008

14

Local similarity

Global measure: ed(A,B)or search problem variant: ed(T[j’...j], P[1..m]).

How to adapt the DP alg to search for a (short)pattern P in a (long) text T?

Very simply. Each position in T may start a match,so we set

C0,j = 0 for all i.

Then we go column-wise (we calculate columns C[j], one by one, for j=1...n)

Page 15: Łódź, 2008

15

But the complexity is O(mn). Even in the best case.

So, there have been found algorithms not always better in the worst case but better on average.

Very flexible: e.g. you can associate positive weights(penalty costs) with each of the elementary error type

(i.e., insertion, deletion, substitution)and then such a generalized edit distance calculation

problem is solved after a trivial modificationof the basic algorithm.

Formula for this case (for the seach problem variant), at text position j:

DP approach

Page 16: Łódź, 2008

16

Partitioning lemma for the edit distance

We look for approximate occurrences of a pattern, with max allowed error k.

Lemma (Rivest, 1976; Wu & Manber, 1992): If the pattern is split into k+1 (disjoint) pieces,

then at least one piece must appear unaltered in anapproximate occurrence of the pattern.

More generally we can say that if splitting P into k+l partsthen at least l pieces must appear unaltered.

Page 17: Łódź, 2008

17

Partitioning lemmais a special case of the Dirichlet principle

Dirichlet principle (aka pigeonhole principle)is a very obvious (but useful in math) general observation.

Roughly, it says that if a pigeon is not going to occupy apigeonhole which already contains a pigeon,

there is no way to fit n pigeons in less than n pigeonholes.

Others prefer an example with rabbits.If you have 10 rabbits and 9 cages, (at least) one cage

must have (at least) two rabbits.Or (more appropriate for our partitioning lemma):

9 rabbits and 10 cages one cage must be empty.

Page 18: Łódź, 2008

18

Dirichlet principle(if you want to be serious)

For any natural number n, there does not exist a bijection between

a set S such that |S|=n and a proper subset of S.

Page 19: Łódź, 2008

19

Partitioning lemma in practice

Approx. string matching with max error k (edit distance):

• divide the pattern P into k+1 disjoint parts of lengthm/(k+1),

• run any multiple exact string matching alg for those k+1 subpatterns,

• verify all matches (need a tool for approximatematching anyway... Could be dynamic programming).

Page 20: Łódź, 2008

20

Indel distance

Very similar to edit distance, but only INsertionsand DELetions are allowed.

Trivially, indel(A, B) edit(A, B).

Both edit() and indel() distance functionsare metrics.

That is, they satisfy the four conditions:non-negativity,

indentity of indescernibles,symmetry

and the triangle inequality ( d(A, B) d(A, C) + d(C, B) ).

Page 21: Łódź, 2008

21

Hamming distance

Very simple (but with limited applications).

By analogy to the binary alphabet,dH(S1, S2) is the number of positions at which

S1 and S2 differ.

If | S1 | | S2 |, then dH(S1, S2) = .

Example

S1 = Donald DuckS2 = Donald Tusk

dH(S1, S2) = 2.

Page 22: Łódź, 2008

22

Longest Common Subsequence (LCS)

Given strings A, B, |A| = n, |B| = m, find the longest subsequence shared by both strings.

More precisely, find 1 i1 i2 ... ik–1 ik n, and 1 j1 j2 ... jk–1 jk m, such thatA[i1] = B[j1], A[i2] = B[j2], ..., A[ik] = B[jk]

and k is maximized.

k is the length of the LCS(A, B), also denoted as LLCS(A, B).

Sometimes we are interested in a simpler problem:finding only the LLCS, not the matching sequence.

Page 23: Łódź, 2008

23

LCS applicationsdiff utility

(e.g., comparing two different versions of a file, or two versions of a large programming project)

molecular biology(Biologists find a new sequence. What other seq. it is most similar to?)

finding the longest ascending subsequenceof a permutations of the integers 1..n.

longest common increasing sequence.

LCS dynamic programming formula

Page 24: Łódź, 2008

24

LCS length calculationvia dynamic programming

[http://www-igm.univ-mlv.fr/~lecroq/seqcomp/node4.html#SECTION004]

Page 25: Łódź, 2008

25

LCS, Python code

s1, s2 = "tigers", "trigger"

prev = [0] * (len(s1)+1)print prev

for ch in s2: curr = [0] * (len(s1)+1) for c in range(len(s1)): curr[c+1] = max(prev[c+1], curr[c]) if ch != s1[c] else prev[c] + 1

prev = curr print prev

Page 26: Łódź, 2008

26

Comparing code versions,highlighted lines – common to both versions

LCS(source_left, source_right) = 8

Page 27: Łódź, 2008

27

LCS, anything better than plain DP?

The basic dyn. programming is clearly O(mn) in the worst case.

Surprisingly, we can’t beat this result significantlyin the worst case.

The best practical idea for the worst case is a bit-parallelalgorithm (there are a few variants)

with O(n m/w) time (and a few times faster than the plain DP in practice).

Still, we also have algorithms with output-dependentcomplexities, e.g., the Hunt–Szymanski (1977) one

with O(r log m) worst case time,where r is the number of matching cells in the DP matrix

(that is, r is mn in the worst case).

Page 28: Łódź, 2008

28

Text indexing

If many searches are expected to be run over a text(e.g., a manual, a collection of journal papers),

it is worth to sacrifice space and preprocessing timeto build an index over the text

supporting fast searches.

A full-text index: match to any position in Tis available through it.

Not all text indexes are full-text ones.For example, word based indexes will find

P’s occurrences in T only at word boundaries.(Quite enough in many cases, and less spaceconsuming, often more flexible in some ways.)

Page 29: Łódź, 2008

29

Suffix tree. (Weiner, 1973)

The Lord of the Strings

Suffix tree ST(T) is basically a Patricia triecontaining all n suffixes of T.

Space: (n) words = (n log n) bits (but with a large constant).

Construction time: O(n log ) in the worst case, or O(n) with high probability (classic ideas),or O(n) in the worst case (Farach, 1997),

for an integer alphabet and = O(n).

Search time: O(m log + occ)(occ – the number of occurrences of P in T)

Page 30: Łódź, 2008

30

Suffix tree example

http://en.wikipedia.org/wiki/Image:Suffix_tree_BANANA.svg

Page 31: Łódź, 2008

31

Suffix tree, pros and cons

+ excellent search complexity,

+ good search speed in practice,

+ some advanced queries can be handled with ST easily too,

- lots of space: about 21n bytes (incl. 1n for the text) for the worst case even in best implementations(about 11n on avg in the Kurtz implementation),

- construction algorithms quite complicated.

Page 32: Łódź, 2008

32

Suffix array (Manber & Myers, 1990)

A surprisingly simple (yet efficient) idea.

Sort all the suffixes of T, store their indexesin a plain array (n indexes, each 4 bytes typically).

Keep T as well (total space occupancy: 4n+1n = 5n bytes,

much less than with a suffix tree).

Search for P:compare P against the median suffix

(that is: read the median suffix index, then refer to theoriginal T). If not found, go left or right, depending

on the comparison result, each time halving therange of suffixes. So, this is binary search based.

Page 33: Łódź, 2008

33

SA exampleT = abracadabra

http://en.wikipedia.org/wiki/Suffix_arrayWe could have a $ terminator after T, actually...

Page 34: Łódź, 2008

34

SA example, cont’dNow, sort the suffixes lexicographically

SA(T) = {11, 8, 1, 4, 6, 9, 2, 5, 7, 10, 3}

Page 35: Łódź, 2008

35

s = "abracadabra"offsets = range(1, len(s)+1) offsets.sort(cmp=lambda a, b: -1 \ if s[a-1:] < s[b-1:] else 1)print offsets

[11, 8, 1, 4, 6, 9, 2, 5, 7, 10, 3]

abracadabra example,suffix array in Python

Or a shorter code:

s = "abracadabra"print sorted(range(1, len(s)+1), \ cmp=lambda a, b: -1 if s[a-1:] < s[b-1:] else 1)

Page 36: Łódź, 2008

36

SA search properties

The search basic mechanism is thateach pattern occurring in text T

must be some prefix of a suffix of T.

Worst case search time: O(m log n + occ).

But in practice it is closer to O(m + log n + occ).

SA: very simple, very practical,very inspiring.

Page 37: Łódź, 2008

37

How to create the suffix array (=sort suffixes)efficiently

The classic integer sorting algorithms (e.g., quick sort, merge sort)

are no good for sorting suffixes.

They are quite slow in typical cases and extremely slow (need e.g. O(n2 log n) time) in pathological cases;a pathological case may be e.g. an extremely long

repetition of the same short pattern (abcabcabc...abc – a few million times), or a concatenation of two copies of the

same book.

Page 38: Łódź, 2008

38

Better solutions

There are O(n) worst-case time algorithms forbuilding the suffix TREE. It is then easily possible to obtain

the suffix array from the suffix tree in O(n) time.But this approach is not practical.

Some other choices:

Manber–Myers (1990), O(n log n) (but slow in practice)

Larsson–Sadakane (1999), O(n log n) (quite practical, used as one of sort components in bzip2 compressor)

Kärkkäinen–Sanders (2003), O(n) directly (not via ST building)

Puglisi–Maniscalco (2006), fast in practice.