Efficient Top-k Algorithms for Approximate Substring Matching Presented by Jagadeesh Potluri Shiva Krishna Imminni

Efficient Top-k Algorithms for Approximate Substring

MatchingPresented by

Jagadeesh PotluriShiva Krishna Imminni

Our Presentation covers

• Introduction• Problem Statement• Approach• Performance Results

Introduction

• With availability of vast amounts of data, retrieving similar strings becomes a more challenging problem today.

• Applications:• web search, music data retrieval, finding DNA subsequences and many more

• Given a large database of texts and a query string, we need to find an efficient way to search for similar strings or sub-strings.

• Edit distance is the most widely accepted distance measures for database applications.

Problem Statement

• Traditional approximate substring matching requests a user to specify a similarity threshold.

• This behavior is not efficient as- There is no global threshold value that works for all types of data.- This requires fine tuning of threshold value.

• We are going to opt for algorithms which return top-K results- K - total number of results a user would like to see.

Example

• Given a query string, the Top K algorithm fetches the Top-K results of the approximate sub-string matches from the database of text.• Query string is ‘Jackson’ and k = 3

• s1 - substring edit distance is 0.• s2 - substring edit distance is 3.• s3 - substring edit distance is 2.• s4, s5, s6 - substring edit distances are 1.

• Output- Top 3 ={‘Jackson Pollock’, ‘Jacksomville’, ‘Jakson Pollack’}

TopK Algorithms

• TopK-Naive• TopK-LB• TopK-Split

TopK-Naïve• Given a set of strings ‘D’ and a query string ‘σ’• Algorithm:

1. HTopK = an empty max-heap storing<smallest substr-edit-distance, string>;2. For every string s in D,

• computes the substring edit distance dsub(s, σ).• If the size of HTopK is less than k

• insert the string s to HTopK

• Else• if dsub(s, σ) < dsub(sR, σ )

• delete sR of HTopK

• insert the string s to HTopK

minimum among the edit distance between σ and every substring of s

TopK-Naïve

• Is this efficient?• Examines every string s in D and compute the substring edit distance dsub(s, σ)

one by one.• Computation of substring edit distance dsub(s, σ) is very expensive.

Can we do better ?

• By utilizing q-grams in the query strings (TopK-LB) and• Inverted q-gram indexes (TopK-Split).

q-grams and Inverted q-gram Indexes• Positional q-grams: • For the string s=‘Jackson’ and q=3,

• (‘Jac’,1), (‘ack’,2), (‘cks’,3), (‘kso’,4) and (‘son’,5) are the positional 3-grams of the string s.

• Inverted q-gram index of D

TopK-LB

• What is LB (Lower Bound)?

• s – string• σ - query• c - number of common q-grams• q- q-gram

Cont..

S A M P L E S T R |s| = 9

S A M P L E S |s|-3+1 = 7

Assume there are no matching qgrams then according previous formulae we get,d(s, σ) = ceil(7/3 ) = 3 i.e,

Assume there are is 2 matching qgrams then according previous formulae we get,d(s, σ) = ceil(7-2/3 ) = 2 i.e,

S A M P L E S T R

S A M P L E S

S A M P L E S

Lower Bound for Substring

• ci be the number of common q-grams between σ and s[i, i+|σ|-1]• Time complexity - O(|σ| ・ |s|)• Towards tight lower bound and O(l2) algorithm where l number of

matching q-gram pairs which is << min(|σ|,|s|)

Calculating lo(dsub(s, σ)

• Given σ =‘Jacksonville’, |σ|=12 and s=‘Jack Willson’.

• Matching positional 3-gram:• Xσ = {(Jac,1), (ack,2), (son,5), (ill,9)}• Ys = {(Jac,1), (ack,2), (ill,7), (son,10)}

Finding LB by DYN-LB

• m[i, j]: For a positional q-gram pair <(xi, pi), (yj , rj)> such that (xi, pi) X∈ σ ,(yj , rj) Y∈ s and xi=yj

• =

• ‘Jac’ -> m[1, 1] = 0, ‘ack’ -> m[2, 2] = 0, ‘ill’ -> m[4,3] = 2, ‘son’ -> m[3,4] = 2

Query q-gramWith its position

Input string q-gram with its position

Choosing LB

• ‘Jac’ -> m[1, 1] = 0, ‘ack’ -> m[2, 2] = 0, ‘ill’ -> m[4,3] = 2 and ‘son’ -> m[3,4] = 2• lo(dsub(s,σ))

• m[1][1] = 0 + 3 = 3, m[2][2] = 0 + 3 = 3, m[3][4] = 2 + 2 = 4, m[4][3] = 2 + 1 = 3 • lo(dsub(s, σ)) = min{4,min{3,3,4,3}} = 3

Ceil((12-3+1)/3)

Example for TopK-LB• σ = ‘Jacksen’, |σ| = 7,k = 2

• s1 and s2, dsub(s1, σ) = 1, dsub(s2, σ)) = 4• inserted into the max-heap HTopK.

• Next, lo(dsub(s3, σ)) = 2 by DYN-LB, • dsub(sR, σ) = 4 and lo(dsub(s3, σ)) < 4

• compute dsub(s3,q).

• Next, lo(dsub(s4, σ))= 1• dsub(sR, σ) = 3 and lo(dsub(s4, σ)) < 3

• compute dsub(s4, σ).

At the end of this step Max Heap contains {(s1, 1), (s3, 3)}



Cont…

• Next, lo(dsub(s5, σ))= 2 which is not less than 2• Skip edit distance computation

• Next, lo(dsub(s6, σ))= 1• dsub(sR, σ) = 2 and lo(dsub(s6, σ )) < 2

• compute dsub(s6, σ).

• Summary:• we calculated the substring edit distances with 5 strings out of 6 strings.



TopK-Split improves on TopK-LB

• In TopK-LB, for every string S we are calculating Lower bound.• Can we reduce LB computations??• Split data set D into DG

+ and DG

-

• Calculate LB for strings that fall in DG+ only.

Computing Best G`

1. Get inverted index of positional qgrams in σ

Cont..

2. Return non overlapping qgram set g’ in σ of length τ that has minimum u[i, τ]

Cont..

• Finally, to select the best G` with size 2• we choose [8, 2] = {‘ack’, ‘onv’} for G’ since u[8, 2] is the minimum among u[6,

2], u[7, 2] and u[8, 2]

Performance Results

• We have evaluated performance of TopK-Naïve and TopK-LB by varying K length• Number of Strings = 865361 , Query Length = 7, qgram=3, Java Heap

space=4096MB

Cont..

• We are going to evaluate the performance (execution time) of three algorithms by varying the following parameters• K length• Query length• Length of grams• Buffer size• Input data set (Wikipedia and DBLP)

• For all the parameters, we are expecting the execution time should be• TopK-Naïve > TopK-LB > TopK-Split

Documents

Efficient Top-k Algorithms for Approximate Substring Matching Presented by Jagadeesh Potluri Shiva Krishna Imminni