53
Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Searching strings using the waves

An efficient index structure for string databases

Ingmar BrounsJacob Kleerekoper

Page 2: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Overview

Introduction A lower bound on the edit distance What is a wavelet? A refinement of the lowerbound MRS index Searching using the MRS index Questions

Page 3: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

String searching

Searching in a string database S = {s1, …, sd}

Range searchFind all the substrings of S within a distance r

to a search string qThe error rate is denoted by ε = r / |q|

K-Nearest neighbor searchFind the k-closest substrings of S to q

Page 4: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Question from Adriano

What is the query range r in ε = r / |q|?

When performing range search, r denotes the maximum value the edit distance of a substring of the database may have to be a valid result

Page 5: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

f(s) is a frequency vector

Σ is an alphabet of σ characters s is a string from that alphabet Σ f(s) = [v1, …, vσ]

where vi = frequency of i-th letter of Σ in s

Sum of v1, …, vσ = length of s

Page 6: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Example of f(s)

Σ = {A, C, G, T} s = CTACATCGATCGATCAG #A = 5, #C = 5, #G = 3, #T = 4 f(s) = [5, 5, 3, 4] Sum of v1, …, vσ = length of s

5 + 5 + 3 + 4 = 17

Page 7: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Question from Laurence

“As a result of lemma 1, the transformation of a string of length n lie on the σ-1 dimensional plane that passes through the point [n, 0, …, 0] and is perpendicular to the normal vector [1, …, 1]”Why is that the case?

Page 8: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Answer to Laurence Take a string s with |s| = 4 and an alphabet Σ = {A, B, C} This string might be AAAA, BBBB, CCCC and 21 more The corresponding f(s) are [4, 0, 0], [0, 4, 0] and [0, 0, 4].

All the f(s) span the same 2D plane in the 3D space with equation v1 + v2 + v3 = 4 which makes the normal vector [1, 1, 1]

This holds in general for every alphabet length n since v1

max + … + vn max = n has always the normal vector [1, …, 1]

Page 9: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

f(s) = [v1, ..., vσ]

Insert

vi := v

i + 1

Delete

vi := v

i - 1

Replace

vi := v

i + 1 and v

j := v

j - 1

with i ≠ j

Edit operations on s

Page 10: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

The space in which all the possible points [v1, ..., vσ] exist

Take u and v points in the σ-dimensional space

We call u and v neighbors if you can obtain u from v using one edit operation

The σ-dimensional space

Page 11: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Take u and v frequency vectors of two strings of the same alphabet (points in the σ-dimensional space)

The Frequency distance FD1 (u, v)

between u and v is the minimum number of steps to get from u to v by jumping each step to a neighbor point

Frequency distance FD1

Page 12: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

The edit distance ED(s1, s

2) of strings s

1 and

s2 is the minimal number of edit operations to

get from s1 to s

2

FD1(f(s

1), f(s

2)) ≤ ED (s

1, s

2)

s1 = AC, s

2 = CA, f(s

1) = [1,1,0,0], f(s

2) =

[1,1,0,0]

FD1 = 0

ED = 2 (two replaces or one insert and one delete)

Frequency distance vs. edit distance

Page 13: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

FD1(f(s

1), f(s

2)) ≤ ED (s

1, s

2)

Proof:

In case of a single insert or delete in ED, rule 1 or 2 is used in FD

1. Now ED as well as FD

1 are

incremented resp. decremented by 1

In case of an insert and a delete or a replace in the ED, the FD

1 always uses rule 3: v

i := v

i + 1

and vj := v

j – 1

This will result in a lower value for the FD1 than

the ED, hence the ≤ sign

So FD1 is the lower bound on ED

Frequency distance vs. edit distance

Page 14: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Take q and s strings from alphabet Σ, r is the range (maximum ED in range search)

if r < FD1(f(q), f(s))

then r < ED(q, s)

To compute ED costs O(nm) time, but FD

1 costs only O(σ)

The lower bound of ED

Page 15: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Take frequency vectors u and v of two strings of the same alphabet Σ

We collect total positive distance (pos) and total negative distance (neg)

for every letter i in σ

if ui > v

i we add the difference u

i – v

i to pos

otherwise we add vi – u

i to neg

return the maximum of pos and neg

Computing FD1

Page 16: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

How does it work?

u1 < v

1, so add 8-2 to neg

u2 > v

2, so add 10-1 to

pos Now pos = 6, neg = 9 return 9

u = [2, 10]

v = [8, 1]

6 replaces and 3 inserts

Page 17: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Improving the lower bound

We’ve established a lower bound on the edit distance, namely de frequency distance

But we can improve this lower bound by incorpotating more information then how often letters occure. We would like to have more info about when they occur.

Page 18: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Wavelets

Wavelet transformProblems with Fourier transform

Representation of frequencies in signal But we do not know when these frequencies occur

Shows time & frequencies Used in all sorts of signal processing

(compression) JPEG2000

Page 19: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

How does this affect us?Suppose we have some signalWe can encode this signal by recursively

taking the average of a part of the signal, and then the difference between the averages of half of this part.

Page 20: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper
Page 21: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Wavelets

Now with strings AT

frequency vector (average) = [1,0,0,1]Detail = [1,0,0,0] – [0,0,0,1] = [1,0,0,-1]Note that we know by the detail that the first

half was an A and the second half was a T.

Page 22: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Wavelets (Adriano)TCACTTAG

TCAC TTAG

TC AC TT AG

T C A C T T A G

[0,

0,

0,

1]

[0,

1,

0,

0]

[1,

0,

0,

0]

[0,

1,

0,

0]

[0,

0,

0,

1]

[0,

0,

0,

1]

[0,

0,

1,

0]

[0,

0,

1,

0]

Page 23: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

WaveletsTCACTTAG

TCAC TTAG

TC AC TT AG

T C A C T T A G

[0,

0,

0,

1]

[0,

1,

0,

0]

[1,

0,

0,

0]

[0,

1,

0,

0]

[0,

0,

0,

1]

[0,

0,

0,

1]

[1,

0,

0,

0]

[0,

0,

1,

0]

[0,

1,

0,

1]

[1,

1,

0,

0]

[0,

0,

0,

2]

[1,

0,

1,

0]

Page 24: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

TCACTTAG

TCAC TTAG

TC AC TT AG

T C A C T T A G

[0,

0,

0,

1]

[0,

1,

0,

0]

[1,

0,

0,

0]

[0,

1,

0,

0]

[0,

0,

0,

1]

[0,

0,

0,

1]

[1,

0,

0,

0]

[0,

0,

1,

0]

[0,

1,

0,

1]

[1,

1,

0,

0]

[0,

0,

0,

2]

[1,

0,

1,

0]

[1,

2,

0,

1]

[1,

0,

1,

2]

[2,

2,

1,

3]

Page 25: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

TCACTTAG

TCAC TTAG

TC AC TT AG

T C A C T T A G

[0,

0,

0,

1]

[0,

1,

0,

0]

[1,

0,

0,

0]

[0,

1,

0,

0]

[0,

0,

0,

1]

[0,

0,

0,

1]

[1,

0,

0,

0]

[0,

0,

1,

0]

[0,

1,

0,

1]

[1,

1,

0,

0]

[0,

0,

0,

2]

[1,

0,

1,

0]

[1,

2,

0,

1]

[1,

0,

1,

2]

[2,

2,

1,

3]

[0,

-1,

0,

1]

Page 26: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

TCACTTAG

TCAC TTAG

TC AC TT AG

T C A C T T A G

[0,

0,

0,

1]

[0,

1,

0,

0]

[1,

0,

0,

0]

[0,

1,

0,

0]

[0,

0,

0,

1]

[0,

0,

0,

1]

[1,

0,

0,

0]

[0,

0,

1,

0]

[0,

1,

0,

1]

[1,

1,

0,

0]

[0,

0,

0,

2]

[1,

0,

1,

0]

[1,

2,

0,

1]

[1,

0,

1,

2]

[2,

2,

1,

3]

[0,

2,

-1,

-1]

[-1,

0,

0,

1]

[0,

-1,

0,

1]

[1,

-1,

0,

0]

[1,

-1,

0,

0]

[-1,

0,

-1,

2]

[1,

0,

-1,

0]

Page 27: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

TCACTTAG

TCAC TTAG

TC AC TT AG

T C A C T T A G

[2,

2,

1,

3]

[0,

2,

-1,

-1]

[-1,

0,

0,

1]

[0,

-1,

1,

0]

[1,

-1,

0,

0]

[1,

-1,

0,

0]

[-1,

0,

-1,

2]

[1,

0,

-1,

0]

Page 28: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

The kth wavelet transformation

Definition 4 Let s=c0,...cn-1 be a string from the alphabet {α1,.., ασ}, then kth-level wavelet transformation, ψk(s), 0 ≤ k ≤ log2n, of s is defined as:

ψk(s) = [vk,0,..,vk,n/(2^k)-1] where vk,i = [Ak,i,Bk,i]

Page 29: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

The 0th wavelet transformation

The 0th wavelet transformation defines the original string

ψk(s) = [v0,0,..,v0,(n/1)-1] where vk,i = [Ak,i,Bk,i] For TCACTTAG that is V0,0,..,V0,7

A0,0,..A0,7

Page 30: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

The log2n wavelet transformation

In the article they only chose to use the first and second wavelet coefficient, this corresponds to the log2n wavelet transformation.

ψk(s) = [vk,0,..,vk,n/(2^k)-1] so only vlog(n),0 For TCACTTAG that is v3,0 A3,0= A2,0+A2,1 A2,0 = A1,0 + A1,1 , A2,1 = A1,2 + A1,3 A1,0 = A0,0 + A0,1 , A1,1 = A0,2 + A0,3 etc A3,0=[2,2,1,3] B3,0 = A2,0 – A2,1 = [0,2,-1,-1]

Page 31: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Theorem 3 (Bogdan)

String S with coefficients [A,B] Where A=[a1,..,aσ] and B = [b1,..,bσ] How can an edit operation influence A and B Replace first half & second half

ai:= ai+1, aj:= aj-1, bi:= bi+1 , bj:= bj-1

ai:= ai+1, aj:= aj-1, bi:= bi-1 , bj:= bj+1

Delete & Insert ai:= ai+-1 , bj:= bj+-1

Page 32: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Theorem 3

Delete on string of even lengthai:= ai+-1 , bi:= bi-1 , bj:= bj+2AABA , A=[3,1] B=[1,-1]ABA A=[2,1] B=[0,1]

Insert on string of odd lengthai:= ai+-1 , bi:= bi+1 , bj:= bj-2ABA A=[2,1] B=[0,1]AABA , A=[3,1] B=[1,-1]

Page 33: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

The lower bound

So if we have ψ(si) and ψ(sj), the five steps listed at theorem two can be used to walk from ψ(si) to ψ(sj). (These are two points in 2σ dimensional

space) So now the FD2(ψ(si), ψ(sj)) is the shortest legal

path using these steps from ψ(si) to ψ(sj)

Page 34: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper
Page 35: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

A table of trees Ti,j with index structure

A column stands for string sj of database S =

{s1, ..., s

d} with 1 < j < d

A row stands for a resolution (or window-size) 2i with a < i < a + l -1 and l the number of resolution-levels in the index

Each tree Ti,j consists of several Minimum

Bounding Rectangles, MBR's, containing several wavelet-coefficients depending on the given capacity c of the MBR

The MRS index structure

Page 36: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Take string s1 = CTAGTCGA

Let's build the tree T2,1

, given c = 3

window-size w = 4 (= 2i)string-number j = 1

Take a window of size w and slide along s1

The first MBR contains the 1st and 2nd coefficient of the first c substrings in the window:{φ(CTAG), φ(TAGT), φ(AGTC)}next MBR: {φ(GTCG), φ(TCGA)}

Building the MRS index (1)

Page 37: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

s1 = CTAGTCGA, c = 3

So T2,1

= {φ(CTAG), φ(TAGT), φ(AGTC)},

{φ(GTCG), φ(TCGA)}

Next T3,1

= {φ(CTAGTCGA)} and consist of

only 1 MBR

Normally sj is much bigger than a + l - 1

(the maximum resolution)

Et cetera

Building the MRS index (2)

Page 38: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Search for a subquery of length 2i?Just take row R

i = {T

i,1, ..., T

i, d} of the table

Take a query string q and a MBR B:FD(q, B) is the minimum of all the FD(q, s) where s Є B, soif r ≤ FD(q, B) then r ≤ FD(q, s) for all s Є B

Wavelet coefficients of substrings obtained by sliding the window are very close to each other, so the set of coefficients in an MBR are highly clustered.

Some remarks on the index structure

Page 39: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Range Queries

We are searching for all sequences that are within an edit distance of r from the query string

Easy case: the index contains a resolution that exactly fits the size of the query string

Page 40: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Range queries

Query string has length 2a

For the corresponding row in the index, for every database sequence, we compute the FD of the query string to the MBR’s.

If r ≤ FD(q,B) then r ≤ FD(q,s) for all s elem B If r < FD(q,B) then r < ED(q,s) for every s elem B So if FD(q,B) > r , then we drop B

Page 41: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Range queries

However we may have some false positives. If r > FD(q,s) this does not guarantee that

r > ED(q,s) for every s elem B Thats why we have to post process all

strings that are in the boxes for which r > FD(q,B). (e.g. By dynamic programming)

Page 42: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Range queries

Now what if there is no row with a resolution corresponding to the query string.

We partition the query stringWe take the longest possible suffix such that

the resolution exists in the indexWe continue doing this interatively, so we get

q1,q2,..,qt

Page 43: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Range queries

Page 44: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper
Page 45: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Nearest neighbour queries

Given some query we search for the k closest substrings in the database.

Phase 1 Lookup the set of k closest MBRs to the query r1 is the kth smallest edit distance to strings in the set

Phase 2 RangeSearch(q,r1) Return the k closest strings

Why phase 2 ? FDbox10 ≤ FDbox11,FD10 ≤ ED10, FD11 ≤ ED11

However this does not guarantee that ED10 ≤ ED11

Page 46: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Questions (Peter)

It is nice that they can prove that the MRS index does not incur any false drops (Theorem 4), but is this also true in a practical sense? If r ≤ FD(q,B) then r ≤ FD(q,s) for all s elem B If r < FD(q,B) then r < ED(q,s) for every s

elem BSo if FD(q,B) > r , then we drop B

Page 47: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Questions (Lee)

The article focuses on substring matching. What adaptations would we need for whole matching?

Determine [A,B] for every entire string in the DB

Determine FD of q to each string

Page 48: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Questions (Bogdan)

The definition of FD(q,B) in section 3.3 says that the distance between a query transformation and a box is the minimum of the distances between the query transformation and the transformations in that box. It is also mentioned in the same section that for each box (MBR) only the lower/higher end points and the starting location of the first substring contained in that MBR are stored as part of the index

Further on, in section 4.4, a part of the range query algorithm implies the computation of FD(q,B) for various (query, MBR) pairs. However, since we only have the lower/higher end points for each MBR, how is it possible to compute FD(q,B) without retrieving all the substrings s_i that are in the box B from the disk

I could think of alternatively defining FD(q,B) with a formula involving only the lower/higher end points of the box B, but this is not what the authors are suggesting/using.

Page 49: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Questions(Bogdan)

Page 50: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper
Page 51: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

I read the article several times, and I don't understand wavelet coefficients.What does this mean and how can it be used for string comparison?

Do the wavelet coefficients depend on the data itself or only on the frequency of the appearance of the data?

Probably clear by Ingmars explanation

Questions from Marjolijn

Page 52: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

In the article is said that they use the edit distance. But they also mention the weighted edit distance (ED), why don't they use this one? Does it take more calculation time?

In the FD you don’t know anymore if there was an delete + insert or a replace.

Questions from Marjolijn

Page 53: Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Is the algorithm with the edit distance useful for data like DNA where we know for sure that several changes depend on each other and occur more often than other ones?

Same answer as previous slideNo, not if you take these special occasions into account

Questions from Marjolijn