Upload
kamea
View
28
Download
1
Embed Size (px)
DESCRIPTION
An Efficient Index Structure for String Databases. Tamer Kahveci Ambuj K. Singh. Department of Computer Science University of California Santa Barbara. http://www.cs.ucsb.edu/~tamer. Whole/Substring Matching Problem. - PowerPoint PPT Presentation
Citation preview
1
An Efficient Index Structure for String DatabasesTamer KahveciAmbuj K. Singh
Department of Computer ScienceUniversity of CaliforniaSanta Barbara
http://www.cs.ucsb.edu/~tamer
2
Whole/Substring Matching Problem
Find similar substrings in a database, that are similar to a given query string quickly, using a small index structure (1-2 % of database size).
query string
database string
3
String Similarity
Motivation: Applications
Genetic sequence databases, NCBI
Text databases, spell checkers, web search.
Video databases (e.g. VIRAGE, MEDIA360)
Database size is too large. Most of the techniques available are in-memory.
Space requirement of current indexes is too large.
Year
Base Pairs (millions)
4
Outline
Motivation & backgroundOur contribution Frequency vector, frequency distance
& wavelet transform Multi-resolution index structure k-NN & range queries
Experimental resultsConclusion
5
Notation
q : query string.m,n : length of strings.r : range query radius. = r/|q|: error rate.
6
String Similarity: an example
A C T - - T A G C
R I I D
A A T G A T A G -
7
Background
Edit operations: Insert Delete Replace
Edit distance (ED) between s1 and s2 = minimum number of edit operations to transform s1 to s2.
Finding the edit distance is costly. O(mn) time and space if m and n are lengths of s1 and
s2 if dynamic programming is used [NW70, SW81].
8
Related Work
Lossless search Online
[Mye86] (Myers) reduce space requirement to O(rn), where r is query radius.
[WM92] (Wu, Manber) binary masks, O(rn). [BYN99] (Beaze-Yates, Navarro) NFA
Offline (index based) [Mye94] (Myers) condensed r-neighborhood. [BYN97] (Beaze-Yates, Navarro) dictionary.
Lossy search [AG90] (Altschul, Gish) BLAST.
FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, MumMER.
[GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree
9
Outline
Motivation & backgroundOur contribution Frequency vector, frequency distance
& wavelet transform Multi-resolution index structure k-NN & range queries
Experimental resultsConclusion
10
Frequency Vector
Let s be a string from the alphabet ={1, ..., }. Let ni be the number of occurrences of the character i in s for 1i, then
frequency vector: f(s) =[n1, ..., n].Example: s = AATGATAG f(s) = [nA, nC, nG, nT] = [4, 0, 2, 2]
11
Effect of Edit Operations on Frequency Vector
Delete : decreases an entry by 1.Insert : increases an entry by 1.Replace : Insert + DeleteExample: s = AATGATAG => f(s) = [4, 0, 2, 2] (del. G), s = AAT.ATAG => f(s) = [4, 0, 1,
2] (ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2] (AC), s = ACCTATAG => f(s) = [3, 2, 1, 2]
12
An Approximation to ED:Frequency Distance (FD1)
s = AATGATAG => f(s)=[4, 0, 2, 2]q = ACTTAGC => f(q)=[2, 2, 1, 2] pos = (4-2) + (2-1) = 3 neg = (2-0) = 2 FD1(f(s),f(q)) = 3 ED(q,s) = 4
FD1(f(s1),f(s2))=max{pos,neg}.
FD1(f(s1),f(s2)) ED(s1,s2).
f(q)
FD1(f(q),f(s))
f(s)
13
An Illustration of Frequency Distance & Edit Distance
Frequency
DistanceSet of strings 1
Set of strings 2
v1 v2
Edit Distance
14
Using Local Information: Wavelet Decomposition of Strings
s = AATGATAC => f(s)=[4, 1, 1, 2]
s = AATG + ATAC = s1 + s2
f(s1) = [2, 0, 1, 1]
f(s2) = [2, 1, 0, 1]
1(s)= f(s1)+f(s2) = [4, 1, 1, 2]
2(s)= f(s1)-f(s2) = [0, -1, 1, 0]
15
Wavelet Decomposition of a String: General Idea
Ai,j = f(s(j2i : (j+1)2i-1))
Bi,j = Ai-1,2j - Ai-1,2j+1
(s)=
First wavelet coefficientSecond wavelet coefficient
16
Wavelet Decomposition & ED
Define FD(s1,s2)=max{FD1, FD2}.
17
Outline
Motivation & backgroundOur contribution Frequency vector, frequency distance
& wavelet transform Multi-resolution index structure k-NN and range queries
Experimental resultsConclusion
18
MRS-Index Structure Creation
w=2a
transform
s1
19
MRS-Index Structure Creation
s1
20
MRS-Index Structure Creation
s1
21
MRS-Index Structure Creation
...s1
slide c times
c=box capacity
22
MRS-Index Structure Creation
s1
...
23
MRS-Index Structure Creation
...
Ta,1
s1
W=2a
24
Using Different Resolutions
...
Ta,1
s1
W=2a
...
Ta+1,1
W=2a+1
25
MRS-Index Structure
26
MRS-index properties
Relative MBR volume (Precision) decreases when c increases. w decreases.
MBRs are highly clustered. Box volume
Box Capacity
27
Outline
Motivation & backgroundOur contribution Frequency vector, frequency distance
& wavelet transform Multi-resolution index structure k-NN & range queries
Experimental resultsConclusion
28
Range Queries [KS01]
208
16 64 128
...w=24
...w=25
...w=26
...w=27
...
...
...
...
...
...
...
...
...
...
...
...
s1 s2 sd
1=
2 1
3 2
29
k-Nearest Neighbor Query [KSF+96, SK98]
k = 3
30
k-Nearest Neighbor Query
k = 3
r = Edit distance to 3rd closest substring
31
k-Nearest Neighbor Query
k = 3
r
32
k-Nearest Neighbor Query
k = 3
33
Outline
Motivation & backgroundOur contributionExperimental resultsConclusion
34
Experimental Settings
w={128, 256, 512, 1024}.Human chromosomes from (www.ncbi.nlm.nih.gov) chr02, chr18, chr21, chr22 Plotted results are from chr18 dataset.
Queries are selected from data set randomly for 512 |q| 10000. An NFA based technique [BYN99] is implemented for comparison.
35
Experimental Results 1:Effect of Box Capacity (10-NN)
36
Experimental Results 2:Effect of Window Size (10-NN)
37
Experimental Results 3:k-NN queries
38
Experimental Results 4:Range Queries
39
Outline
Motivation & backgroundOur ContributionExperimental resultsDiscussion & conclusion
40
Discussion
In-memory (index size is 1-2% of the database size).Lossless search.3 to 45 times faster than NFA technique for k-NN queries.2 to 12 times faster than NFA technique for range queries.Can be used to speedup any previously defined technique.
41
Future Work
Extend to weighted edit distance and affine gaps.Extend to local similarity (substring/substring) search.Compare the quality of answers and speed to BLAST (lossy search).Use as a preprocessing step to BLAST.Apply the MRS index structure for larger alphabet size (e.g. protein sequences.).
42
Related Work
Lossless search Online
[Mye86] (Myers) reduce space requirement to O(rn), where r is query radius.
[WM92] (Wu, Manber) binary masks, O(rn). [BYN99] (Beaze-Yates, Navarro) NFA
Offline (index based) [Mye94] (Myers) condensed r-neighborhood. [BYN97] (Beaze-Yates, Navarro) dictionary.
Lossy search [AG90] (Altschul, Gish) BLAST.
FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, MumMER.
[GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree
43
Related Work (Similar problems)
[BYP92] (Beaze-Yates, Perleberg) only replace is allowed.[Gus97] (Gusfield) exact matching, suffix trees.[JKS00] (Jagadish, Koudas, Srivastava) exact matching with wild-cards for multidimensional strings, elided trees and R-tree.
44
THANK YOU
45
Frequency Distance to an MBR
f(q)
FD(f(q),f(s))
f(s)
f(q)
FD(f(q),B)
B