70
Efficient Exact Set- Efficient Exact Set- Similarity Joins Similarity Joins Arvind Arasu Arvind Arasu Venkatesh Ganti Venkatesh Ganti Raghav Kaushik Raghav Kaushik DMX Group, Microsoft Research DMX Group, Microsoft Research

Efficient Exact Set-Similarity Joins Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research

Embed Size (px)

Citation preview

Efficient Exact Set-Similarity Efficient Exact Set-Similarity JoinsJoins

Arvind ArasuArvind ArasuVenkatesh GantiVenkatesh GantiRaghav KaushikRaghav Kaushik

DMX Group, Microsoft ResearchDMX Group, Microsoft Research

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 22

Data CleaningData Cleaning

NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO

1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799

GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT

LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607

CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCONLLINCONL ILIL 9279992799

INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL

SANTA ANASANTA ANA CACA 9279992799

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 33

Data CleaningData Cleaning

NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO

1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799

GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT

LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607

CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCONLLINCONL ILIL 9279992799

INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL

SANTA ANASANTA ANA CACA 9279992799

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 44

Data CleaningData Cleaning

NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO

1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799

GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT

LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607

CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCOLINCONLNL ILIL 9279992799

INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL

SANTA ANASANTA ANA CACA 9279992799

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 55

Data CleaningData Cleaning

NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO

1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799

GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT

LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607

CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCOLINCONLNL ILIL 9279992799

INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL

SANTA ANASANTA ANA CACA 9279992799

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 66

Data CleaningData Cleaning

NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO

1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799

GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT 0690106901

LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER NYNY 1460714607

CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCOLINCOLNLN ILIL 9279992799

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 77

String Similarity JoinString Similarity Join

CITYCITY

ALABASTERALABASTER

ALBERTVILLEALBERTVILLE

……

……

……LINCOLNLINCOLN

……

……YUCAIPAYUCAIPA

Reference Table

…… …… CityCity …… ………… …… …… …… ……

…… …… LINCOLINCONLNL …… ……

…… …… …… …… ……

…… …… …… …… ……

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 88

NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO

1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799

GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT

LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607

CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCONLLINCONL ILIL 9279992799

INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL

SANTA ANASANTA ANA CACA 9279992799

String Similarity (Self) JoinString Similarity (Self) Join

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 99

Strings Strings Sets [CGK ’06] Sets [CGK ’06]

microsoft mcrosoft

{mc, cr, ro, os, so, of, ft}{mi, ic, cr, ro, os, so, of, ft}

(edit distance (edit distance ≤ 1) ----> (≤ 1) ----> (ΔΔ ≤ 4) ≤ 4)

2-grams2-grams

mcrosoft…

……

microsoft…

……

… SR

String Sim Join edit distance edit distance ≤ 1≤ 1

Strings Sets

mcrosoft…

……

microsoft…

……

Set Sim Join ΔΔ ≤ 4≤ 4

R S

TokenizeTokenize

Post-Process

Strings Sets

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 1212

String String Set: Advantages Set: Advantages

Generalizes to many string similarity Generalizes to many string similarity funcsfuncs Powerful primitivePowerful primitive

Sets Sets ≈ Relations≈ Relations Leverage relational data processingLeverage relational data processing

[CGK ‘06][CGK ‘06]

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 1313

ContributionsContributions

New algorithms for set-similarity New algorithms for set-similarity joinsjoins Exact answersExact answers Performance guaranteesPerformance guarantees Outperform previous exact algorithmsOutperform previous exact algorithms

Orders of magnitudeOrders of magnitude

Exact answers are important for operatorsExact answers are important for operators

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 1414

OutlineOutline

IntroductionIntroduction AlgorithmsAlgorithms ExperimentsExperiments ConclusionConclusion

{ mi, ic, cr, ro, os, so, of, ft }

{ lo, og, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft }

{ lg, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ … }

SR

{ mi, ic, cr, ro, os, so, of, ft }

{ lo, og, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft }

{ lg, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ … }

SR

Intersection size Intersection size ≥ 5 ≥ 5

{ mi, ic, cr, ro, os, so, of, ft }

{ lo, og, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft }

{ lg, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ … }

SR

Intersection size Intersection size ≥ 5 ≥ 5

{ mi, ic, cr, ro, os, so, of, ft }

{ lo, og, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft }

{ lg, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ … }

SR

Intersection size Intersection size ≥ 5 ≥ 5

{ mi, ic, cr, ro, os, so, of, ft }

{ lo, og, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ bo, oe, ei, in, ng }

{ mc, cr, ro, os, so, of, ft }

{ lg, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ … }

SR

{ mc, cr, ro, os, so, of, ft }

{ mi, ic, cr, ro, os, so, of, ft }

Intersection size Intersection size ≥ 5 ≥ 5

{ mi, ic, cr, ro, os, so, of, ft }

{ lo, og, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ bo, oe, ei, in, ng }

{ mc, cr, ro, os, so, of, ft }

{ lg, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ … }

SR

{ mc, cr, ro, os, so, of, ft }

{ mi, ic, cr, ro, os, so, of, ft }

Intersection size Intersection size ≥ 5 ≥ 5

{ lg, gi, is, so, of, ft }

{ lo, og, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ bo, oe, ei, in, ng }

{ … }

{ … }

{ … }

{ … }

{ … }

SR

{ mc, cr, ro, os, so, of, ft }

{ mi, ic, cr, ro, os, so, of, ft }

Sim Sim ( ( rrii , s , sjj ) ) ≥ ≥ θθ

{ lg, gi, is, so, of, ft }

{ lo, og, gi, is, so, of, ft }

ss22

ss33

ssmm

ss11

rr22

rr33

rrnn

rr11

{ … }

{ … }

{ … }

{ … }

{ bo, oe, ei, in, ng }

{ … }

{ … }

{ … }

{ … }

{ … }

SR

{ mc, cr, ro, os, so, of, ft }

{ mi, ic, cr, ro, os, so, of, ft }

Sim Sim ( ( rrii , s , sjj ) ) ≥ ≥ θθ

{ lg, gi, is, so, of, ft }

{ lo, og, gi, is, so, of, ft }

ss22

ss33

ssmm

ss11

rr22

rr33

rrnn

rr11

Larg

e

Input:Input: R: R: rr11 , , rr22 , … , , … , rrnn (n sets) (n sets) S: S: ss1 1 , , ss2 2 , … , , … , ssmm (m sets) (m sets)

Output: All pairs (Output: All pairs (rrii , s , sj j ) such that:) such that: ||rrii ΔΔ s sjj | | ≤ ≤ kk

Set-Similarity Join: Symmetric Set-Similarity Join: Symmetric DifferenceDifference

≤ kRunning example: Running example: k k = 4= 4

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2424

Alternate Set Alternate Set RepresentationRepresentation

s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2525

Alternate Set Alternate Set RepresentationRepresentation

s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }

1 25 50

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2626

Alternate Set Alternate Set RepresentationRepresentation

s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }

1 25 50

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2727

Alternate Set Alternate Set RepresentationRepresentation

s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }

1 25 50

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2828

Alternate Set Alternate Set RepresentationRepresentation

s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }

1 25 50

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2929

EnumerationEnumeration

s

r

|r Δ s | ≤ 4

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3030

EnumerationEnumeration

s

r

|r Δ s | ≤ 4

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3131

EnumerationEnumeration

s

r

|r Δ s | ≤ 4

ErrorsErrors

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3232

EnumerationEnumeration

2 3 4 51

s

r

|r Δ s | ≤ 4

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3333

Enumeration: Signature Enumeration: Signature GenerationGeneration

s

, , ,,{ }

Sig (s )

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3434

Enumeration: Signature Enumeration: Signature GenerationGeneration

s

, , ,,{ }

Sig (s )

{ 0x4f72ba91, 0x29c8af10, 0x594b2c17, 0xa3b0e20f, 0xdd21f32a}

Hash32()

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3535

Property of SignaturesProperty of Signatures

||r r ΔΔ ss | | ≤ 4≤ 4 Sig (Sig (rr ) Sig ( ) Sig (s s ) ) ≠ ≠ ΦΦ

UU

2 3 4 51

s

r

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3636

Enumeration: AlgorithmEnumeration: Algorithm

Generate signatures for each Generate signatures for each rrii , , ssjj

Enumerate (Enumerate (rrii , s , sjj ) s.t ) s.t Sig ( Sig (rrii ) Sig () Sig (ssjj ) ) ≠ ≠ ΦΦ

Output those satisfying |Output those satisfying |rrii ΔΔ ssjj | ≤ 4| ≤ 4

U

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3737

EnumerationEnumeration

s1

s5

s2

s3

s4

Sig (s2)

Sig (s5)

Sig (s3)

Sig (s4)UU

r1

r5

r2

r3

r4

Sig (s1)

Sig (r2)

Sig (r5)

Sig (r3)

Sig (r4)

Sig (r1)

Sig (Sig (rr22)) Sig (Sig (ss11)) ≠≠ ΦΦ

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3838

EnumerationEnumeration

s1

s5

s2

s3

s4

Sig (s2)

Sig (s5)

Sig (s3)

Sig (s4)UU

r1

r5

r2

r3

r4

Sig (s1)

Sig (r2)

Sig (r5)

Sig (r3)

Sig (r4)

Sig (r1)

Sig (Sig (rr22)) Sig (Sig (ss11)) ≠≠ ΦΦ

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3939

EnumerationEnumeration

s1

s5

s2

s3

s4

Sig (s2)

Sig (s5)

Sig (s3)

Sig (s4)UU

r1

r5

r2

r3

r4

Sig (s1)

Sig (r2)

Sig (r5)

Sig (r3)

Sig (r4)

Sig (r1)

Sig (Sig (rr22)) Sig (Sig (ss11)) ≠≠ ΦΦ

OutputOutput False positive candidate pairsFalse positive candidate pairs

S (Id, Elem)

R.Sig = S.Sig

δ R.Id, S.Id

R (Id, Elem)

Post-Process each R.Id, S.Id

Gen SignaturesGen Signatures

S’ (Id, Sig)R’ (Id, Sig)

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4141

No False Positive Candidate No False Positive Candidate PairPair

2 3 4 51

s

r

|r Δ s | = 5

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4242

False Positive Candidate False Positive Candidate PairPair

s2

s1

2 3 4 51

|r Δ s | = 5

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4343

Enumeration: PerformanceEnumeration: Performance

0

0.25

0.5

0.75

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Symmetric Difference

Pro

bab

ility

of

Co

mm

on

Sig

nat

ure

k = 4

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4444

Enumeration: PerformanceEnumeration: Performance

0

0.25

0.5

0.75

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Symmetric Difference

Pro

bab

ility

of

Co

mm

on

Sig

nat

ure

Ideal PerformanceIdeal Performance

k = 4

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4545

EnumerationEnumeration

|r Δ s | ≤ 4

s

r

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4646

EnumerationEnumeration

2 3 4 61 5

s

r

|r Δ s | ≤ 4

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4747

Enumeration: Signature Enumeration: Signature GenerationGeneration

s1

2 3 4 61 5

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4848

Enumeration: Signature Enumeration: Signature GenerationGeneration

s1

2 3 4 61 5

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4949

Enumeration: Signature Enumeration: Signature GenerationGeneration

s1

2 3 4 61 5

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5050

Enumeration: Signature Enumeration: Signature GenerationGeneration

s1

2 3 4 61 5

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5151

Enumeration: Signature Enumeration: Signature GenerationGeneration

s1

2 3 4 61 5

( )( )6622

= 15= 15

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5252

AlgorithmAlgorithm

Generate signatures for each Generate signatures for each rrii , , ssjj

Enumerate (Enumerate (rrii , s , sjj ) s.t ) s.t Sig ( Sig (rrii ) Sig () Sig (ssjj ) ) ≠ ≠ ΦΦ

Output those satisfying |Output those satisfying |rrii ΔΔ ssjj | ≤ 4| ≤ 4

U

Only the signature function changesOnly the signature function changes

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5353

Enumeration: PerformanceEnumeration: Performance

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Symmetric Difference

Pro

b. o

f Com

mon

Sig

natu

re

n1 = 5 n1 = 6

k = 4

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5454

False Positive Candidate False Positive Candidate PairPair

2 3 4 61 5

s

r

|r Δ s | = 5

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5555

Enumeration: PerformanceEnumeration: Performance

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Symmetric Difference

Prob

. of C

omm

on S

igna

ture

n1 = 5 n1 = 6 n1 = 7 n1 = 20

k = 4

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5656

Enumeration: PerformanceEnumeration: Performance

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Symmetric Difference

Prob

. of C

omm

on S

igna

ture

n1 = 5 n1 = 6 n1 = 7 n1 = 20

55

15153535

48454845

k = 4

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5757

PartEnum: Divide and PartEnum: Divide and ConquerConquer

s1

21

k = 4

k2 = 1k1 = 2

Generate signatures using EnumerationGenerate signatures using Enumeration

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5858

PartEnum: Asymptotic PartEnum: Asymptotic PerformancePerformance

Theorem: There is an instance of Theorem: There is an instance of PartEnum such that: PartEnum such that: If If ||r r ΔΔ s s || > 7.5 > 7.5 kk, , then then r r and and s s do not do not

share a signature with probability 1 – share a signature with probability 1 – o(1)o(1)

The number of signatures per set: The number of signatures per set: O (O (kk22 ) )

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5959

PartEnum: SummaryPartEnum: Summary

Set-Similarity Joins with predicate Set-Similarity Joins with predicate ||rr ΔΔ ss | ≤ | ≤ kk

Theoretical guaranteesTheoretical guarantees First exact algorithmFirst exact algorithm

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6060

Other resultsOther results

PartEnum extensions:PartEnum extensions: Larger class of set-similarity join predicatesLarger class of set-similarity join predicates

JaccardJaccard Basic idea: reduce to symmetric set Basic idea: reduce to symmetric set

differencedifference WtEnumWtEnum class of signature functions: class of signature functions:

Use frequency of elementsUse frequency of elements Weighted set-similarity joinsWeighted set-similarity joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6161

OutlineOutline

IntroductionIntroduction AlgorithmsAlgorithms ExperimentsExperiments ConclusionConclusion

S (Id, Elem)

R.Sig = S.Sig

δ R.Id, S.Id

R (Id, Elem)

Post-Process each R.Id, S.Id

Gen SignaturesGen Signatures

Implementation

DBMSDBMS

Client + DBMSClient + DBMS

DBMSDBMS

ClientClient

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6363

Previous WorkPrevious Work

Prefix Filtering [CGK ’06]Prefix Filtering [CGK ’06] ExactExact

Locality Sensitive Hashing [IM ’98]Locality Sensitive Hashing [IM ’98] ApproximateApproximate False negative rate: 5%False negative rate: 5%

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6464

Data SetsData Sets

Organization addresses [MS Sales]Organization addresses [MS Sales] Concatenation: Org name, street, city, Concatenation: Org name, street, city,

zipzip Input size: 1 millionInput size: 1 million Avg. length: 11 words, 58 charsAvg. length: 11 words, 58 chars Tokenization: Words, n-gramsTokenization: Words, n-grams

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6565

Jaccard, 1M, MS SalesJaccard, 1M, MS Sales

0

1000

2000

3000

4000

PEN LSH PF PEN LSH PF PEN LSH PF

Sec

on

ds

SigGen CandPair PostFilter

0.80.9 0.85

S (Id, Elem)

R.Sig = S.Sig

δ R.Id, S.Id

R (Id, Elem)

Post-Process each R.Id, S.Id

Gen SignaturesGen Signatures

Evaluation

DBMSDBMS

DBMSDBMS

IntermediateIntermediateResult sizeResult size

Client + DBMSClient + DBMS

ClientClient

Jaccard, 1M, MS SalesJaccard, 1M, MS Sales

0.00E+00

5.00E+07

1.00E+08

1.50E+08

2.00E+08

2.50E+08

PEN LSH PF PEN LSH PF PEN LSH PF

Inte

rmed

iate

Res

ult

Siz

e

0.80.9 0.85

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6868

Jaccard, SyntheticJaccard, Synthetic

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1.0E+08

1.0E+09

1.0E+10

1.0E+11

1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07 1.0E+08 1.0E+09

Input Size

Inte

rmed

iate

Res

ult S

ize

LSH(0.95) PEN PF

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6969

Similar Results for …Similar Results for …

Other data setsOther data sets DBLP, Synthetic data setsDBLP, Synthetic data sets

Other similarity functionsOther similarity functions Weighted jaccardWeighted jaccard Edit distanceEdit distance

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 7070

ConclusionConclusion

New algorithms for set-similarity New algorithms for set-similarity joinsjoins ExactExact Performance guaranteesPerformance guarantees Outperform previous exact algorithmsOutperform previous exact algorithms

Search: “data cleaning project”