E cient Edit Distance based String Similarity Search using …leser/searchjoincompetition... · E cient Edit Distance based String Similarity Search using Deletion Neighborhoods Shashwat

Efficient Edit Distance based String Similarity Search usingDeletion Neighborhoods

Shashwat Mishra, Tejas Gandhi, Akhil Arora, Arnab Bhattacharya

Special Interest Group in Data,IIT Kanpur

String Similarity Search/Join Workshop,EDBT 2013

March 21, 2013

Introduction

Main Task

• Given a dictionary of strings, perform fast lookups/joins of similarstrings in the dictionary.

• Use edit distance to quantify the notion of similarity betweenstrings.

• Two tracks:

1. Join: Perform self-join of strings. Given a threshold τ , list all stringpairs in the dictionary with edit-distance ≤ τ .

2. Search: Process range queries as specified in a supplied query file.

2 of 24

Introduction

Main Task



• Two tracks:



2 of 24

Introduction

Main Task



• Two tracks:



2 of 24

Introduction

Search Track: Problem Statement

Given a set of strings, D, and a query tuple containing a query string,q, and an edit distance threshold τ , identify all pairs < q, s > s.t.s ∈ D and EditDistance(q, s) ≤ τ .

• Generate and maintain an index structure for the dictionaryrespecting certain time and memory constraints.

• Use the index structure to process a list of queries (2-tuples). Listall answers in the specified format.

• Evaluation Parameter: Minimize total time taken to process allqueries.

3 of 24

Introduction

Search Track: Problem StatementGiven a set of strings, D, and a query tuple containing a query string,q, and an edit distance threshold τ , identify all pairs < q, s > s.t.s ∈ D and EditDistance(q, s) ≤ τ .




3 of 24

Introduction

Search Track: Problem StatementGiven a set of strings, D, and a query tuple containing a query string,q, and an edit distance threshold τ , identify all pairs < q, s > s.t.s ∈ D and EditDistance(q, s) ≤ τ .




3 of 24

Problem Statement: Further Details

• Index construction time not counted towards the score, but mustbe less than 3 hrs.

• Peak memory consumption must be less than 48 GB.

• Score dependent solely on Teffective where

Teffective = tend − tbegin

where tbegin and tend are time instances marking the beginning andthe end of the processing of the query file, respectively.

• Evaluation environment: 8 cores, 64 GB RAM, FC 17.

4 of 24

The Dictionary

• Dictionary 1: Geographicalnames◦ Names of places from

across the globe.◦ Character-set |Σ|=255◦ Mean string length µ=10◦ Dictionary size |D|=400K

• Dictionary 2: Humangenome read data◦ Strings containing human

genomic data.◦ Character-set |Σ|=4◦ Mean string lengthµ=100.3

◦ Dictionary size |D|=750K

0

10

20

30

40

50

5 10 15 20 25 30 35 40 45 50 55

Num

ber

of S

trin

gs (

in 1

03)

String Length (in characters)

Length Distribution

0

50

100

150

200

250

300

350

400

86 88 90 92 94 96 98 100 102 104 106

Num

ber

of

Str

ings (

in 1

03)

String Length (in characters)

Length Distribution

5 of 24

Typical Methodology

• Observation: Modern methods follow a general scheme.

◦ Generate a signature for each string in the dictionary.◦ Maintain an inverted index for generated signatures.◦ Signature must result in a (tight) filtering criteria.◦ Given a query string, q, use the filter and query signature to generate

a candidate list.◦ For each string s ′ in the candidate list, verify if s ′ is answer.

• Signature results in a filtering criteria. ?◦ Signature Function SF (s) maps string s to a signature space.◦ Filtering condition is specific to the choice of the signature function.◦ In general, for string s, query string q, threshold τ , filtering condition

is an inequality.F (s, q, τ, SF ) <> 6= 0

6 of 24

Typical Methodology

• Observation: Modern methods follow a general scheme.◦ Generate a signature for each string in the dictionary.◦ Maintain an inverted index for generated signatures.◦ Signature must result in a (tight) filtering criteria.

◦ Given a query string, q, use the filter and query signature to generatea candidate list.

◦ For each string s ′ in the candidate list, verify if s ′ is answer.



6 of 24

Typical Methodology

• Observation: Modern methods follow a general scheme.◦ Generate a signature for each string in the dictionary.◦ Maintain an inverted index for generated signatures.◦ Signature must result in a (tight) filtering criteria.◦ Given a query string, q, use the filter and query signature to generate




6 of 24

Typical Methodology

• Observation: Modern methods follow a general scheme.◦ Generate a signature for each string in the dictionary.◦ Maintain an inverted index for generated signatures.◦ Signature must result in a (tight) filtering criteria.◦ Given a query string, q, use the filter and query signature to generate




6 of 24

Signature

• Completeness of solution required → No False Dismissals.

• Filter condition must ensure no false dismissals.

• For query < q, τ >, if ED(s, q) ≤ τ then F (s, q, τ, SF ) must hold.

• Computation of signature should be computationallynon-expensive.

• Filter condition should be tight. Resultant candidate list should besmall.

• Popular signature schemes in literature:◦ Q-Gram◦ Deletion Neighborhood

7 of 24

Signature

• Completeness of solution required → No False Dismissals.

• Filter condition must ensure no false dismissals.

• For query < q, τ >, if ED(s, q) ≤ τ then F (s, q, τ, SF ) must hold.

• Computation of signature should be computationallynon-expensive.

• Filter condition should be tight. Resultant candidate list should besmall.

• Popular signature schemes in literature:◦ Q-Gram◦ Deletion Neighborhood

7 of 24

Signature: Deletion Neighborhood

• Defined w.r.t. a specified edit distance threshold.

• For a string s and edit distance τ , deletion neighborhood signatureof s, SFτ (s) is the set of all strings that can be generated bydeleting at-most τ characters from s.

• Ex. s = ”SIGDATA”◦ SF1(s) ={”IGDATA”,”SGDATA”,”SIDATA”,”SIGATA”,”SIGDTA”,

”SIGDAA”,”SIGDAT”, ”SIGDATA”}• Filtering Condition: -

If ED(s1,s2) ≤ τ , then |SFτ (s1) ∩ SFτ (s2)| > 0

• Intuition: Both strings s1, s2 should be reducible to a commonform after at-most τ deletion operations.

8 of 24




• Ex. s = ”SIGDATA”

◦ SF1(s) ={”IGDATA”,”SGDATA”,”SIDATA”,”SIGATA”,”SIGDTA”,”SIGDAA”,”SIGDAT”, ”SIGDATA”}

• Filtering Condition: -



8 of 24





”SIGDAA”,”SIGDAT”, ”SIGDATA”}




8 of 24








8 of 24








8 of 24








8 of 24

Signature: Q-Gram

• For a string s, Q-Gram signature of s, SF (s) is the set of all stringsthat can be generated by taking q contiguous characters from s.

• Ex. s = ”SIGDATA”, Q = 3◦ SF (s) ={”SIG”,”IGD”,”GDA”,”DAT”,”ATA”}


If ED(s1,s2) ≤ τ , then |SF (s1)∩SF (s2)| ≥ max(|s1|, |s2|)+1−(τ+1).Q

• In practice, RHS = |q|+ 1− (τ + 1).Q, where q is the query string.

• For filter to be tight, RHS > 0 → |q| > (τ + 1).Q − 1.◦ Serious limitation ! For Q = 2, τ = 2, only queries with |q| ≥ 6 can

be processed, for τ = 3, |q| ≥ 8.

9 of 24

Signature: Q-Gram








9 of 24

Signature: Q-Gram








9 of 24

Signature: Q-Gram








9 of 24

Signature: Q-Gram








9 of 24

Choice of Signature

• Deletion Neighborhood seems to be better than Q-Gram.◦ No restriction on query size.◦ Every inverted list should (expectedly) be more selective.

• However, they have high space requirement.

• For |s| = 14, τ = 2,◦ |SF3−Gram| = |s| − Q + 1 = 12 O(|s|)◦ |SFDeletion| =

(|s|τ

)= 91 O(|s|τ )

10 of 24

Choice of Signature




(|s|τ

)= 91 O(|s|τ )

10 of 24

Our System

Design Decision

Decided to implement a system following the generic scheme andusing deletion neighborhood as the signature scheme.

11 of 24

Choice of Signature




(|s|τ

)= 91 O(|s|τ )

• Challenge: Reduce space complexity of a deletion neighborhoodsignature based system while maintaining completeness of solution.

• Signature defined w.r.t. a threshold, need dedicated indexstructures, Iτ , for each threshold, τ = [0 : 4].

12 of 24

Choice of Signature




(|s|τ

)= 91 O(|s|τ )

• Challenge: Reduce space complexity of a deletion neighborhoodsignature based system while maintaining completeness of solution.

• Signature defined w.r.t. a threshold, need dedicated indexstructures, Iτ , for each threshold, τ = [0 : 4].

12 of 24

Reducing Space Requirement

• Key Idea: Introduce collisions among objects in SFτ (s).

O1

O2

O3

O4

O ′1

O ′2

O ′3

• Possible Solution ? Hashing !• For O ∈ SFτ (s), h(O) = suffixL(O).◦ Hashing results in a reduced set representation of SFτ (s).◦ Ex. for s =”SIGDATA”, τ = 1, L = 3,

• h(SF (s)) = {”ATA”, ”DTA”, ”DAA”, ”DAT”}• |h(SFτ (s))| = 4 < |SFτ (s)| = 8

• Added Benefit: Need not generate all elements in SFτ (s).|h(SFτ (s))| = O(

(L+ττ

)).

13 of 24



O1

O2

O3

O4

O ′1

O ′2

O ′3




(L+ττ

)).

13 of 24



O1

O2

O3

O4

O ′1

O ′2

O ′3

• Possible Solution ?

Hashing !• For O ∈ SFτ (s), h(O) = suffixL(O).◦ Hashing results in a reduced set representation of SFτ (s).◦ Ex. for s =”SIGDATA”, τ = 1, L = 3,



(L+ττ

)).

13 of 24



O1

O2

O3

O4

O ′1

O ′2

O ′3




(L+ττ

)).

13 of 24

Our System

Design Decision

To implement a system following the generic scheme and usingdeletion neighborhood as the signature.

Reduction of Space Requirement

To hash generated signature and index the resultant strings.

14 of 24

Our System

Design Decision




14 of 24

Hashing: Completeness Guarantee ?

• Any hashing scheme guarantees the completeness of solution.

◦ If s is an answer for a query < q, τ >, then ∃ o s.t. o ∈ SFτ (s) ando ∈ SFτ (q).

◦ Thus, h(o) ∈ h(SFτ (s)), h(SFτ (q)), and hence s is in the candidatelist.

• Why suffix ?◦ No real reason. Initial design decision.◦ Performed well, stuck around.

• Possibly lucrative to try other hash functions/schemes.

15 of 24


• Any hashing scheme guarantees the completeness of solution.◦ If s is an answer for a query < q, τ >, then ∃ o s.t. o ∈ SFτ (s) and

o ∈ SFτ (q).◦ Thus, h(o) ∈ h(SFτ (s)), h(SFτ (q)), and hence s is in the candidate

list.



15 of 24




list.

• Why suffix ?

◦ No real reason. Initial design decision.◦ Performed well, stuck around.


15 of 24




list.



15 of 24

Hashing: Layout

• So what does Iτ look like ?

• Hash Table !

h1 h2 h|HK |

i1i2

ip

.....

◦ Keys consist of string resulting from hash (suffix operation).◦ List consists of ids of strings in D that generate the key.

16 of 24

Hashing: Layout


• Hash Table !

h1 h2 h|HK |

i1i2

ip

.....


16 of 24

Hashing: Layout


• Hash Table !

h1 h2 h|HK |

i1i2

ip

.....


16 of 24

Bucketing

• Should Iτ be a single structure (hash-table) for all strings ?

• No reason why !• Infact it helps to partitions strings on the basis of length.◦ Iτ consists of buckets.◦ Every bucket responsible for indexing strings within a particular length

range.

Iτ

B1 B2 B3

3 10

7 16

13 25

2τ 2τ

17 of 24

Bucketing



range.

Iτ

B1 B2 B3

3 10

7 16

13 25

2τ 2τ

17 of 24

Bucketing


• No reason why !• Infact it helps to partitions strings on the basis of length.

◦ Iτ consists of buckets.◦ Every bucket responsible for indexing strings within a particular length

range.

Iτ

B1 B2 B3

3 10

7 16

13 25

2τ 2τ

17 of 24

Bucketing



range.

Iτ

B1 B2 B3

3 10

7 16

13 25

2τ 2τ

17 of 24

Bucketing: Why ?

• Ex. Let s =”PLACATING”, |s| = 9.

• Query < BATING , 2 >. |q| = 6.

• For I2, if L = 4, s will be in the candidate list.◦ ”TING” common hash-key.

• abs(|s| − |q|) > τ . s is ruled out.

• Apply length filter higher up in the pipeline.

• Say Iτ had B1 s.t. the bucket was responsible for range [1 : 8].◦ Query could be answered by B1 alone.◦ B1 would not index ”PLACATING”.

• 25% reduction in average search time for τ = 2.150µs → 110µs .

18 of 24

Bucketing: Why ?


• Query < BATING , 2 >. |q| = 6.






18 of 24

Bucketing: Why ?


• Query < BATING , 2 >. |q| = 6.






18 of 24

Bucketing: Why ?


• Query < BATING , 2 >. |q| = 6.






18 of 24

Our System

Design Decision




Bucketing: Reducing collisions in Hash-Table

Partition strings on basis of length. Apply early length filtering.Results in smaller individual hash-tables. Increases space requirement.

19 of 24

Our System

Design Decision




Bucketing: Reducing collisions in Hash-Table

Partition strings on basis of length. Apply early length filtering.Results in smaller individual hash-tables. Increases space requirement.

19 of 24

Verification

• How to verify if s ∈ Candidate-List is an answer ? Check ifED(s, q) ≤ τ

• Naive Method: Explicitly compute ED(s, q).

• Not interested in value of ED(s, q).

S1

S2

• For each row i , only need to examine j s.t.i −

⌊τ−δ2

⌋< j < i +

⌊τ+δ2

⌋. δ = abs(|s1| − |s2|).

• Li et al., VLDB 2012.

20 of 24

Verification




S1

S2


⌊τ−δ2

⌋< j < i +

⌊τ+δ2

⌋. δ = abs(|s1| − |s2|).

• Li et al., VLDB 2012.

20 of 24

Verification




S1

S2


⌊τ−δ2

⌋< j < i +

⌊τ+δ2

⌋. δ = abs(|s1| − |s2|).

• Li et al., VLDB 2012.20 of 24

Execution

• Generate dedicated index structures, Iτ for each τ .

• Read and group all queries depending on threshold τ .

• Sequentially process queries for each τ = [0 : 4].◦ Multiple threads read their own queries.◦ Thread j responsible for all queries s.t. idq % 8 = j .◦ Result of each query written to a global buffer in memory.

Contention between threads on memory write.

• Flush the buffer to the disk.

21 of 24

Results

• So how do we fare against a state-of-the-art ?

0.001

0.01

0.1

1

10

100

0 1 2 3 4

Avg. Q

uery

Tim

e (

in m

sec)

Edit Distance τ

Average Query Time vs Edit Distance

Flamingo

Proposed

τ Avg. query time (ms)Flamingo Proposed

0 0.015 0.0041 0.146 0.0192 0.901 0.1083 7.245 0.7364 30.906 4.801

22 of 24

Results


0.001

0.01

0.1

1

10

100

0 1 2 3 4

Avg. Q

uery

Tim

e (

in m

sec)

Edit Distance τ


Flamingo

Proposed


0 0.015 0.0041 0.146 0.0192 0.901 0.1083 7.245 0.7364 30.906 4.801

22 of 24

Results


0.001

0.01

0.1

1

10

100

0 1 2 3 4

Avg. Q

uery

Tim

e (

in m

sec)

Edit Distance τ


Flamingo

Proposed


0 0.015 0.0041 0.146 0.0192 0.901 0.1083 7.245 0.7364 30.906 4.801

22 of 24

The Team

23 of 24

The Team

23 of 24

The Team

23 of 24

The Team

23 of 24

The Team

23 of 24

Thank You !

24 of 24

Documents

E cient Edit Distance based String Similarity Search using …leser/searchjoincompetition... · E cient Edit Distance based String Similarity Search using Deletion Neighborhoods Shashwat