An Efficient Index Structure for String Databases

1

An Efficient Index Structure for String DatabasesTamer KahveciAmbuj K. Singh

Department of Computer ScienceUniversity of CaliforniaSanta Barbara

http://www.cs.ucsb.edu/~tamer

2

Whole/Substring Matching Problem

Find similar substrings in a database, that are similar to a given query string quickly, using a small index structure (1-2 % of database size).

query string

database string

3

String Similarity

Motivation: Applications

Genetic sequence databases, NCBI

Text databases, spell checkers, web search.

Video databases (e.g. VIRAGE, MEDIA360)

Database size is too large. Most of the techniques available are in-memory.

Space requirement of current indexes is too large.

Year

Base Pairs (millions)

4

Outline

Motivation & backgroundOur contribution Frequency vector, frequency distance

& wavelet transform Multi-resolution index structure k-NN & range queries

Experimental resultsConclusion

5

Notation

q : query string.m,n : length of strings.r : range query radius. = r/|q|: error rate.

6

String Similarity: an example

A C T - - T A G C

R I I D

A A T G A T A G -

7

Background

Edit operations: Insert Delete Replace

Edit distance (ED) between s1 and s2 = minimum number of edit operations to transform s1 to s2.

Finding the edit distance is costly. O(mn) time and space if m and n are lengths of s1 and

s2 if dynamic programming is used [NW70, SW81].

8

Related Work

Lossless search Online

[Mye86] (Myers) reduce space requirement to O(rn), where r is query radius.

[WM92] (Wu, Manber) binary masks, O(rn). [BYN99] (Beaze-Yates, Navarro) NFA

Offline (index based) [Mye94] (Myers) condensed r-neighborhood. [BYN97] (Beaze-Yates, Navarro) dictionary.

Lossy search [AG90] (Altschul, Gish) BLAST.

FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, MumMER.

[GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree

9

Outline




10

Frequency Vector

Let s be a string from the alphabet ={1, ..., }. Let ni be the number of occurrences of the character i in s for 1i, then

frequency vector: f(s) =[n1, ..., n].Example: s = AATGATAG f(s) = [nA, nC, nG, nT] = [4, 0, 2, 2]

11

Effect of Edit Operations on Frequency Vector

Delete : decreases an entry by 1.Insert : increases an entry by 1.Replace : Insert + DeleteExample: s = AATGATAG => f(s) = [4, 0, 2, 2] (del. G), s = AAT.ATAG => f(s) = [4, 0, 1,

2] (ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2] (AC), s = ACCTATAG => f(s) = [3, 2, 1, 2]

12

An Approximation to ED:Frequency Distance (FD1)

s = AATGATAG => f(s)=[4, 0, 2, 2]q = ACTTAGC => f(q)=[2, 2, 1, 2] pos = (4-2) + (2-1) = 3 neg = (2-0) = 2 FD1(f(s),f(q)) = 3 ED(q,s) = 4

FD1(f(s1),f(s2))=max{pos,neg}.

FD1(f(s1),f(s2)) ED(s1,s2).

f(q)

FD1(f(q),f(s))

f(s)

13

An Illustration of Frequency Distance & Edit Distance

Frequency

DistanceSet of strings 1

Set of strings 2

v1 v2

Edit Distance

14

Using Local Information: Wavelet Decomposition of Strings

s = AATGATAC => f(s)=[4, 1, 1, 2]

s = AATG + ATAC = s1 + s2

f(s1) = [2, 0, 1, 1]

f(s2) = [2, 1, 0, 1]

1(s)= f(s1)+f(s2) = [4, 1, 1, 2]

2(s)= f(s1)-f(s2) = [0, -1, 1, 0]

15

Wavelet Decomposition of a String: General Idea

Ai,j = f(s(j2i : (j+1)2i-1))

Bi,j = Ai-1,2j - Ai-1,2j+1

(s)=

First wavelet coefficientSecond wavelet coefficient

16

Wavelet Decomposition & ED

Define FD(s1,s2)=max{FD1, FD2}.

17

Outline


& wavelet transform Multi-resolution index structure k-NN and range queries


18

MRS-Index Structure Creation

w=2a

transform

s1

19


s1

20


s1

21


...s1

slide c times

c=box capacity

22


s1

...

23


...

Ta,1

s1

W=2a

24

Using Different Resolutions

...

Ta,1

s1

W=2a

...

Ta+1,1

W=2a+1

25

MRS-Index Structure

26

MRS-index properties

Relative MBR volume (Precision) decreases when c increases. w decreases.

MBRs are highly clustered. Box volume

Box Capacity

27

Outline




28

Range Queries [KS01]

208

16 64 128

...w=24

...w=25

...w=26

...w=27

...

...

...

...

...

...

...

...

...

...

...

...

s1 s2 sd

1=

2 1

3 2

29

k-Nearest Neighbor Query [KSF+96, SK98]

k = 3

30

k-Nearest Neighbor Query

k = 3

r = Edit distance to 3rd closest substring

31


k = 3

r

32


k = 3

33

Outline

Motivation & backgroundOur contributionExperimental resultsConclusion

34

Experimental Settings

w={128, 256, 512, 1024}.Human chromosomes from (www.ncbi.nlm.nih.gov) chr02, chr18, chr21, chr22 Plotted results are from chr18 dataset.

Queries are selected from data set randomly for 512 |q| 10000. An NFA based technique [BYN99] is implemented for comparison.

35

Experimental Results 1:Effect of Box Capacity (10-NN)

36

Experimental Results 2:Effect of Window Size (10-NN)

37

Experimental Results 3:k-NN queries

38

Experimental Results 4:Range Queries

39

Outline

Motivation & backgroundOur ContributionExperimental resultsDiscussion & conclusion

40

Discussion

In-memory (index size is 1-2% of the database size).Lossless search.3 to 45 times faster than NFA technique for k-NN queries.2 to 12 times faster than NFA technique for range queries.Can be used to speedup any previously defined technique.

41

Future Work

Extend to weighted edit distance and affine gaps.Extend to local similarity (substring/substring) search.Compare the quality of answers and speed to BLAST (lossy search).Use as a preprocessing step to BLAST.Apply the MRS index structure for larger alphabet size (e.g. protein sequences.).

42

Related Work

Lossless search Online

[Mye86] (Myers) reduce space requirement to O(rn), where r is query radius.

[WM92] (Wu, Manber) binary masks, O(rn). [BYN99] (Beaze-Yates, Navarro) NFA

Offline (index based) [Mye94] (Myers) condensed r-neighborhood. [BYN97] (Beaze-Yates, Navarro) dictionary.

Lossy search [AG90] (Altschul, Gish) BLAST.

FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, MumMER.

[GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree

43

Related Work (Similar problems)

[BYP92] (Beaze-Yates, Perleberg) only replace is allowed.[Gus97] (Gusfield) exact matching, suffix trees.[JKS00] (Jagadish, Koudas, Srivastava) exact matching with wild-cards for multidimensional strings, elided trees and R-tree.

44

THANK YOU

45

Frequency Distance to an MBR

f(q)

FD(f(q),f(s))

f(s)

f(q)

FD(f(q),B)

B

Documents

An Efficient Index Structure for String Databases