Upload
derek-snuggs
View
217
Download
0
Embed Size (px)
Citation preview
Efficient Top-k Algorithms for Approximate Substring
MatchingPresented by
Jagadeesh PotluriShiva Krishna Imminni
Our Presentation covers
• Introduction• Problem Statement• Approach• Performance Results
Introduction
• With availability of vast amounts of data, retrieving similar strings becomes a more challenging problem today.
• Applications:• web search, music data retrieval, finding DNA subsequences and many more
• Given a large database of texts and a query string, we need to find an efficient way to search for similar strings or sub-strings.
• Edit distance is the most widely accepted distance measures for database applications.
Problem Statement
• Traditional approximate substring matching requests a user to specify a similarity threshold.
• This behavior is not efficient as- There is no global threshold value that works for all types of data.- This requires fine tuning of threshold value.
• We are going to opt for algorithms which return top-K results- K - total number of results a user would like to see.
Example
• Given a query string, the Top K algorithm fetches the Top-K results of the approximate sub-string matches from the database of text.• Query string is ‘Jackson’ and k = 3
• s1 - substring edit distance is 0.• s2 - substring edit distance is 3.• s3 - substring edit distance is 2.• s4, s5, s6 - substring edit distances are 1.
• Output- Top 3 ={‘Jackson Pollock’, ‘Jacksomville’, ‘Jakson Pollack’}
TopK Algorithms
• TopK-Naive• TopK-LB• TopK-Split
TopK-Naïve• Given a set of strings ‘D’ and a query string ‘σ’• Algorithm:
1. HTopK = an empty max-heap storing<smallest substr-edit-distance, string>;2. For every string s in D,
• computes the substring edit distance dsub(s, σ).• If the size of HTopK is less than k
• insert the string s to HTopK
• Else• if dsub(s, σ) < dsub(sR, σ )
• delete sR of HTopK
• insert the string s to HTopK
minimum among the edit distance between σ and every substring of s
TopK-Naïve
• Is this efficient?• Examines every string s in D and compute the substring edit distance dsub(s, σ)
one by one.• Computation of substring edit distance dsub(s, σ) is very expensive.
Can we do better ?
• By utilizing q-grams in the query strings (TopK-LB) and• Inverted q-gram indexes (TopK-Split).
q-grams and Inverted q-gram Indexes• Positional q-grams: • For the string s=‘Jackson’ and q=3,
• (‘Jac’,1), (‘ack’,2), (‘cks’,3), (‘kso’,4) and (‘son’,5) are the positional 3-grams of the string s.
• Inverted q-gram index of D
TopK-LB
• What is LB (Lower Bound)?
• s – string• σ - query• c - number of common q-grams• q- q-gram
Cont..
S A M P L E S T R |s| = 9
S A M P L E S |s|-3+1 = 7
Assume there are no matching qgrams then according previous formulae we get,d(s, σ) = ceil(7/3 ) = 3 i.e,
Assume there are is 2 matching qgrams then according previous formulae we get,d(s, σ) = ceil(7-2/3 ) = 2 i.e,
S A M P L E S T R
S A M P L E S
S A M P L E S
Lower Bound for Substring
• ci be the number of common q-grams between σ and s[i, i+|σ|-1]• Time complexity - O(|σ| ・ |s|)• Towards tight lower bound and O(l2) algorithm where l number of
matching q-gram pairs which is << min(|σ|,|s|)
Calculating lo(dsub(s, σ)
• Given σ =‘Jacksonville’, |σ|=12 and s=‘Jack Willson’.
• Matching positional 3-gram:• Xσ = {(Jac,1), (ack,2), (son,5), (ill,9)}• Ys = {(Jac,1), (ack,2), (ill,7), (son,10)}
Finding LB by DYN-LB
• m[i, j]: For a positional q-gram pair <(xi, pi), (yj , rj)> such that (xi, pi) X∈ σ ,(yj , rj) Y∈ s and xi=yj
• =
• ‘Jac’ -> m[1, 1] = 0, ‘ack’ -> m[2, 2] = 0, ‘ill’ -> m[4,3] = 2, ‘son’ -> m[3,4] = 2
Query q-gramWith its position
Input string q-gram with its position
Choosing LB
• ‘Jac’ -> m[1, 1] = 0, ‘ack’ -> m[2, 2] = 0, ‘ill’ -> m[4,3] = 2 and ‘son’ -> m[3,4] = 2• lo(dsub(s,σ))
• m[1][1] = 0 + 3 = 3, m[2][2] = 0 + 3 = 3, m[3][4] = 2 + 2 = 4, m[4][3] = 2 + 1 = 3 • lo(dsub(s, σ)) = min{4,min{3,3,4,3}} = 3
Ceil((12-3+1)/3)
Example for TopK-LB• σ = ‘Jacksen’, |σ| = 7,k = 2
• s1 and s2, dsub(s1, σ) = 1, dsub(s2, σ)) = 4• inserted into the max-heap HTopK.
• Next, lo(dsub(s3, σ)) = 2 by DYN-LB, • dsub(sR, σ) = 4 and lo(dsub(s3, σ)) < 4
• compute dsub(s3,q).
• Next, lo(dsub(s4, σ))= 1• dsub(sR, σ) = 3 and lo(dsub(s4, σ)) < 3
• compute dsub(s4, σ).
At the end of this step Max Heap contains {(s1, 1), (s3, 3)}
At the end of this step Max Heap contains {(s1, 1), (s4, 2)}
At the end of this step Max Heap contains {(s1, 1), (s2, 4)}
Cont…
• Next, lo(dsub(s5, σ))= 2 which is not less than 2• Skip edit distance computation
• Next, lo(dsub(s6, σ))= 1• dsub(sR, σ) = 2 and lo(dsub(s6, σ )) < 2
• compute dsub(s6, σ).
• Summary:• we calculated the substring edit distances with 5 strings out of 6 strings.
At the end of this step Max Heap contains {(s1, 1), (s6, 1)}
At the end of this step Max Heap contains {(s1, 1), (s4, 2)}
TopK-Split improves on TopK-LB
• In TopK-LB, for every string S we are calculating Lower bound.• Can we reduce LB computations??• Split data set D into DG
+ and DG
-
• Calculate LB for strings that fall in DG+ only.
Computing Best G`
1. Get inverted index of positional qgrams in σ
Cont..
2. Return non overlapping qgram set g’ in σ of length τ that has minimum u[i, τ]
Cont..
• Finally, to select the best G` with size 2• we choose [8, 2] = {‘ack’, ‘onv’} for G’ since u[8, 2] is the minimum among u[6,
2], u[7, 2] and u[8, 2]
Performance Results
• We have evaluated performance of TopK-Naïve and TopK-LB by varying K length• Number of Strings = 865361 , Query Length = 7, qgram=3, Java Heap
space=4096MB
Cont..
• We are going to evaluate the performance (execution time) of three algorithms by varying the following parameters• K length• Query length• Length of grams• Buffer size• Input data set (Wikipedia and DBLP)
• For all the parameters, we are expecting the execution time should be• TopK-Naïve > TopK-LB > TopK-Split