47
Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Indexing Text with Approximate q-grams

Adriano Galati & Marjolijn Elsinga

Page 2: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Overview

• Approximate string matching- Neighborhood generation- Reduction to Exact Searching- Intermediate Partitioning

• Indexing text using q-grams• Filtration condition• Finding approximate q-grams

- Trie data structure- Non-deterministic automaton (NFA)

• Parameters

Page 3: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Approximate string matching

Text

Pattern

Goal: Retrieve all occurrences of P in T whose edit distance is at most k

Edit distance: ),( BAed

nT ..1

mP ..1

Page 4: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Solutions

All kinds of solutions, most investigated area in computer science

In on-line versions of the problem the pattern can be preprocessed, the text cannot

Classical solution: using dynamic programming and a matrix is O(mn)

Page 5: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Fill matrix where ìs the minimum edit distance between P and a suffix of T

Initialize the borders with

Fill internal cells with

Classical solution

nmC ..0,..0 jiC ,

0 and ,00, ji CiC

),,min(1

if

1,1,1,1

1,1

jijiji

jiji

CCC

TPC

Page 6: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Solution (2)

If text is large, on-line algorithms are not practical and preprocessing becomes necessary

Focus: Sequence retrieving indexes, with no restrictions on the patterns and the occurrences

Approaches:• Neighborhood Generation• Reduction to Exact Searching• Intermediate Partitioning

Page 7: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Neighborhood Generation

Set of strings matching a pattern with k errors is finite ( )

Therefore it can be enumerated

Each string can be searched using a data structure

This structure is designed for exact matching

)(PU k

)(PU k

Page 8: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Neighborhood Generation (2)

+ O(n) space and construction time- Not optimized for secondary memory- Inefficient in space requirements

Is promising for searching short patterns only

Page 9: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Reduction to Exact Searching

Indexes based on filters

Filter checks for simpler condition than the matching condition, discarding large parts of the text

Main principle: if two strings A and B match with k errors and k+s non-overlapping samples are extracted from A, then at least s of these must appear without errors in B

Page 10: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Reduction to Exact Searching (2)

+ can be built in linear time and need O(n) space

+ with some method it is possible to make an index that takes less space then the text itself

- Are based on suffix trees or on indexing all the q-grams

Page 11: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Intermediate Partitioning

Reduces the search to approximate search instead of exact search

Main principle: if two strings A and B match with at most k errors and j disjoint substrings are taken from A, then at least one of these appears in B with

Split the pattern in j pieces, search each piece in the index allowing errors, extend the approximate matches to complete occurrences

jk /

jk /

Page 12: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Question (Ingmar)

I think the main principle is incorrect, because if AAABBBBBB

BBBBBBBBB

These match with k=3 errors. If we take the disjoint substrings AAA BBB BBB so j=3. Now they say that one of these will appear in the other with errors. However AAA match with 3 errors, BBB with 0 and BBB with 0

13/3

Page 13: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Answer

The pattern is split in j pieces, each piece is searched in the index allowing errors

AAA BBB BBBBBB BBB BBB

We match BBB with ABB and not with AAA and AAB, because it is not possible to match them with more then errors, with k=3 and j=3, unless we change the parameters

jk /

jk /

Page 14: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Intermediate Partitioning (2)

+ optimizing point between neighborhood generating (worse with longer pieces) and reduction to exact searching (worse with shorter pieces)

Has been used on the patterns but not yet on the text itself

Page 15: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Indexing text using q-grams

Steps:• Filtering text• Finding approximate q-grams

Advantages:• Takes little space• Has an alternative tradeoff• User can decide what is important: saving space

or better performance

Page 16: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Filtration condition

Based on locating approximate matches of pattern q-grams in text

Leads to a filtration tolerating higher error levels compared to exact q-gram matching

Page 17: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Condition for an approximate match

Two strings A and B

Now: at least one string Ai appears in B with at most errors

Only the q-grams for which this hold, will be used for searching

kBAed ),(

jj AxxAxAA 12211 ...

jk /

Page 18: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Example: Condition

A: CCTC TCTC CCCT B: CCCC CTCT TCTC

We see: k=8We take: j=3

Now e=2, so at least one Ai appears in B with at most 2 errors

Page 19: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Question (Peter)

“Note that it is possible that , so we are not only ‘distributing’ errors across pieces, but also ‘removing’ some of them”

How does this work?

kjkj /

Page 20: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Answer

A1 A2 A3x1 x2

k=5

j=3

e=1

Page 21: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Q-grams vs. Q-samples

Q-grams overlapQ-samples do not overlap

String: ABCDEFQ-grams: {ABC, BCD, CDE, DEF}Q-samples: {ABC, DEF}

In a q-gram index all the text q-grams are stored in increasing order

In a q-sample index only some text q-grams are stored

Page 22: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Constructing q-samples

We need to extract j pieces from each potential pattern occurrence in the text

So: a q-sample every h text-characters

We need to guarantee that j q-samples are inside any occurrence of P

Minimal length of P = m-k

j

qkmh

1

Page 23: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Question (Jacob)

Could you please explain how the restriction of h is built up?

Page 24: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Answer

j

qkmh

j

qkm

j

n

qkmn

qPn

q

1

1

1

1

1textsamples-q#

Page 25: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Next step

Best match distance (bed) is calculated for each test sequence of q-samples

This is the distance between the q-sample sequence and the involved text (h)

The text area h is only examined if its bed is at most k

Page 26: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Algorithm

Each q-sample sequence has its own counter M

M indicates the number of errors produced by the q-sample sequence and is initialized to

So: we start by assuming that each q-sample gives enough errors to disallow a match

)1( ejM

Page 27: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Error-environment

After calculating the M for each q-sample sequence, we obtain the e-environment of each q-sample sequence

This is the set of possible q-samples that appear inside the q-sample sequence with at most e errors

Page 28: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Finishing

Now all text areas have its own e-environments connected to it through the q-samples

They can be checked with dynamic programming

Page 29: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Finding approximate q-grams

Finding all the text q-samples that appear inside a given pattern block

Note: it is not necessary to generate all since we are interested only in the text q-samples (position)

( ) { 1.. / , ( , ) }qe r iI Q r n h bed d Q e

iQ

( )qe iU Q

Page 30: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Finding approximate q-grams (2)

Idea: to store all the different text q-samples in a trie data structure

We fill in a matrix such that is the sed between and a suffix of

is relevant for some In a trie traversal of the q-samples, the characters

of are obtained one by one

l1..iS 1..lQ

S ,q lC e l

S

0.. ,0..| |q QC

Page 31: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Question (Laurence)

Can you please show me the matrix is build on page section 3 in fig. 4? It is a bit unclear to me how the matrix is initialized and the different cells are being filled.

Page 32: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Answer

i jS Q,i jC if then 1, 1i jC else 1, 1, 1 , 11 min( , , )i j i j i jC C C

Page 33: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Answer

s

u

r

v

e

y

1

2

3

0

1

2

34

5

0 0

4

56

s

1

2

1

1

2

2

21

2

0 0

1

22

eg

1

0

1

1

1

0

12

3

0 0

2

34

ru

1

2

2

1

2

3

33

2

0 0

3

22

yr

Page 34: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Finding approximate q-grams (4)

When we reach the leaf nodes (depth q) we check in if there is a cell with value the corresponding text is reported

Complexity

e

(| | ) ( )O Q q O mq

Page 35: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Finding approximate q-grams (3)

Pruning:• All the value of a row to the next are nondecreasing• If all the values of a row are larger than at that

point we can abandon that branch of the triee

Page 36: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Finding approximate q-grams (5)

Alternative way:• To model the search with a non-deterministic

automaton (NFA)

Page 37: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Finding approximate q-grams (6)

Consider the NFA for errors

Every row denotes the number of errors seen

Every column represents matching a prefix of S

Horizontal arrows represent matching a character

All the others increment the number of errors

2e

Page 38: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Question (Bogdan)

I can imagine how the trie can be used together with the matrix in order to benefit from common prefixes of certain q-samples (by reusing the rows of the matrix which are already computed for the common prefix). However, I don't see how this can be done in the case of the NFA. If it can't be done, this would mean that the algorithm has to be run separately for each q-tuple, which probably makes the NFA approach much worse. Am I right to think that or is there a way to run the NFA in a "smarter" way so as to benefit from common prefixes?

Page 39: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Bogdam (answer)

Yes, you are right, the algorithm has to run for each q-tuple, but you have to consider the complexity of it, that is linear ( )O e

Page 40: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Parameters of the Problem

Smaller value the search of e-environment will be cheaper

Larger value gives more exact estimates of the actual number of error but with a higher cost to search the e-environment

As grows, longer test sequences with less errors per piece are used the cost to find the relevant q-samples decreases but the amount of text verification increases.

e

e

j

Page 41: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Parameters of the Problem (2)

1. Notice: the index of this approach only stores non-overlapping q-samples, its space requirement is small

2. Notice: the space consumption of index depends on the interval h

Page 42: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Parameters of the Problem (3)

Standard implementation q-gram index stores all the locations of all the q-grams of the text

The number of q-grams Storing a position takes space consumption is Ratio between this method and standard

approach

1n q

log n

logn n

/ log( / ) 1

logr

n h n hv

n n h

Page 43: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Question (Bogdan)

Could you please explain what the "columns" used in the 5th section are?

The table shows how the error level increases the number of processed columns of matrix or NFA

Page 44: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Question (Lee/Bram)

The article talks about disjoint non overlapping q-grams. At the end they say that will probably enhance the scheme that they allow overlapping q-grams. Any idea how our current algorithms have to be changed for that and what the advantages are?

http://www.cs.utexas.edu/users/mobios/MoBIoSPapers/2003-IndexingProteinSequences-TR-04-06.pdf

Page 45: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Question (Lee)

In the second paragraph of section 4 they say “In that particular case we can avoid the use of counters…” Can you explain that ?

Page 46: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Answer

The error counters M are initialized at a high value

After that all pattern-blocks are compared to the corresponding text piece and the counter value is updated to a lower value

In this particular case, when e = the error counter can get as low as k+1, which is higher than the initial value

jk /

Page 47: Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga

Any other questions?