View
229
Download
1
Category
Preview:
Citation preview
Efficient Exact Set-Similarity Efficient Exact Set-Similarity JoinsJoins
Arvind ArasuArvind ArasuVenkatesh GantiVenkatesh GantiRaghav KaushikRaghav Kaushik
DMX Group, Microsoft ResearchDMX Group, Microsoft Research
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 22
Data CleaningData Cleaning
NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO
1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799
GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT
LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607
CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCONLLINCONL ILIL 9279992799
INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL
SANTA ANASANTA ANA CACA 9279992799
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 33
Data CleaningData Cleaning
NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO
1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799
GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT
LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607
CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCONLLINCONL ILIL 9279992799
INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL
SANTA ANASANTA ANA CACA 9279992799
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 44
Data CleaningData Cleaning
NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO
1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799
GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT
LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607
CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCOLINCONLNL ILIL 9279992799
INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL
SANTA ANASANTA ANA CACA 9279992799
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 55
Data CleaningData Cleaning
NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO
1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799
GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT
LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607
CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCOLINCONLNL ILIL 9279992799
INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL
SANTA ANASANTA ANA CACA 9279992799
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 66
Data CleaningData Cleaning
NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO
1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799
GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT 0690106901
LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER NYNY 1460714607
CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCOLINCOLNLN ILIL 9279992799
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 77
String Similarity JoinString Similarity Join
CITYCITY
ALABASTERALABASTER
ALBERTVILLEALBERTVILLE
……
……
……LINCOLNLINCOLN
……
……YUCAIPAYUCAIPA
Reference Table
…… …… CityCity …… ………… …… …… …… ……
…… …… LINCOLINCONLNL …… ……
…… …… …… …… ……
…… …… …… …… ……
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 88
NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO
1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799
GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT
LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607
CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCONLLINCONL ILIL 9279992799
INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL
SANTA ANASANTA ANA CACA 9279992799
String Similarity (Self) JoinString Similarity (Self) Join
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 99
Strings Strings Sets [CGK ’06] Sets [CGK ’06]
microsoft mcrosoft
{mc, cr, ro, os, so, of, ft}{mi, ic, cr, ro, os, so, of, ft}
(edit distance (edit distance ≤ 1) ----> (≤ 1) ----> (ΔΔ ≤ 4) ≤ 4)
2-grams2-grams
mcrosoft…
…
……
…
…
…
microsoft…
…
……
…
…
… SR
String Sim Join edit distance edit distance ≤ 1≤ 1
Strings Sets
mcrosoft…
…
……
…
…
…
microsoft…
…
……
…
…
…
Set Sim Join ΔΔ ≤ 4≤ 4
R S
TokenizeTokenize
Post-Process
Strings Sets
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 1212
String String Set: Advantages Set: Advantages
Generalizes to many string similarity Generalizes to many string similarity funcsfuncs Powerful primitivePowerful primitive
Sets Sets ≈ Relations≈ Relations Leverage relational data processingLeverage relational data processing
[CGK ‘06][CGK ‘06]
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 1313
ContributionsContributions
New algorithms for set-similarity New algorithms for set-similarity joinsjoins Exact answersExact answers Performance guaranteesPerformance guarantees Outperform previous exact algorithmsOutperform previous exact algorithms
Orders of magnitudeOrders of magnitude
Exact answers are important for operatorsExact answers are important for operators
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 1414
OutlineOutline
IntroductionIntroduction AlgorithmsAlgorithms ExperimentsExperiments ConclusionConclusion
{ mi, ic, cr, ro, os, so, of, ft }
{ lo, og, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft }
{ lg, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ … }
SR
{ mi, ic, cr, ro, os, so, of, ft }
{ lo, og, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft }
{ lg, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ … }
SR
Intersection size Intersection size ≥ 5 ≥ 5
{ mi, ic, cr, ro, os, so, of, ft }
{ lo, og, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft }
{ lg, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ … }
SR
Intersection size Intersection size ≥ 5 ≥ 5
{ mi, ic, cr, ro, os, so, of, ft }
{ lo, og, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft }
{ lg, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ … }
SR
Intersection size Intersection size ≥ 5 ≥ 5
{ mi, ic, cr, ro, os, so, of, ft }
{ lo, og, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ bo, oe, ei, in, ng }
{ mc, cr, ro, os, so, of, ft }
{ lg, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ … }
SR
{ mc, cr, ro, os, so, of, ft }
{ mi, ic, cr, ro, os, so, of, ft }
Intersection size Intersection size ≥ 5 ≥ 5
{ mi, ic, cr, ro, os, so, of, ft }
{ lo, og, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ bo, oe, ei, in, ng }
{ mc, cr, ro, os, so, of, ft }
{ lg, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ … }
SR
{ mc, cr, ro, os, so, of, ft }
{ mi, ic, cr, ro, os, so, of, ft }
Intersection size Intersection size ≥ 5 ≥ 5
{ lg, gi, is, so, of, ft }
{ lo, og, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ bo, oe, ei, in, ng }
{ … }
{ … }
{ … }
{ … }
{ … }
SR
{ mc, cr, ro, os, so, of, ft }
{ mi, ic, cr, ro, os, so, of, ft }
Sim Sim ( ( rrii , s , sjj ) ) ≥ ≥ θθ
{ lg, gi, is, so, of, ft }
{ lo, og, gi, is, so, of, ft }
ss22
ss33
ssmm
ss11
rr22
rr33
rrnn
rr11
{ … }
{ … }
{ … }
{ … }
{ bo, oe, ei, in, ng }
{ … }
{ … }
{ … }
{ … }
{ … }
SR
{ mc, cr, ro, os, so, of, ft }
{ mi, ic, cr, ro, os, so, of, ft }
Sim Sim ( ( rrii , s , sjj ) ) ≥ ≥ θθ
{ lg, gi, is, so, of, ft }
{ lo, og, gi, is, so, of, ft }
ss22
ss33
ssmm
ss11
rr22
rr33
rrnn
rr11
Larg
e
Input:Input: R: R: rr11 , , rr22 , … , , … , rrnn (n sets) (n sets) S: S: ss1 1 , , ss2 2 , … , , … , ssmm (m sets) (m sets)
Output: All pairs (Output: All pairs (rrii , s , sj j ) such that:) such that: ||rrii ΔΔ s sjj | | ≤ ≤ kk
Set-Similarity Join: Symmetric Set-Similarity Join: Symmetric DifferenceDifference
≤ kRunning example: Running example: k k = 4= 4
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2424
Alternate Set Alternate Set RepresentationRepresentation
s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2525
Alternate Set Alternate Set RepresentationRepresentation
s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }
1 25 50
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2626
Alternate Set Alternate Set RepresentationRepresentation
s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }
1 25 50
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2727
Alternate Set Alternate Set RepresentationRepresentation
s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }
1 25 50
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2828
Alternate Set Alternate Set RepresentationRepresentation
s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }
1 25 50
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2929
EnumerationEnumeration
s
r
|r Δ s | ≤ 4
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3030
EnumerationEnumeration
s
r
|r Δ s | ≤ 4
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3131
EnumerationEnumeration
s
r
|r Δ s | ≤ 4
ErrorsErrors
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3232
EnumerationEnumeration
2 3 4 51
s
r
|r Δ s | ≤ 4
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3333
Enumeration: Signature Enumeration: Signature GenerationGeneration
s
, , ,,{ }
Sig (s )
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3434
Enumeration: Signature Enumeration: Signature GenerationGeneration
s
, , ,,{ }
Sig (s )
{ 0x4f72ba91, 0x29c8af10, 0x594b2c17, 0xa3b0e20f, 0xdd21f32a}
Hash32()
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3535
Property of SignaturesProperty of Signatures
||r r ΔΔ ss | | ≤ 4≤ 4 Sig (Sig (rr ) Sig ( ) Sig (s s ) ) ≠ ≠ ΦΦ
UU
2 3 4 51
s
r
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3636
Enumeration: AlgorithmEnumeration: Algorithm
Generate signatures for each Generate signatures for each rrii , , ssjj
Enumerate (Enumerate (rrii , s , sjj ) s.t ) s.t Sig ( Sig (rrii ) Sig () Sig (ssjj ) ) ≠ ≠ ΦΦ
Output those satisfying |Output those satisfying |rrii ΔΔ ssjj | ≤ 4| ≤ 4
U
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3737
EnumerationEnumeration
s1
s5
s2
s3
s4
Sig (s2)
Sig (s5)
Sig (s3)
Sig (s4)UU
r1
r5
r2
r3
r4
Sig (s1)
Sig (r2)
Sig (r5)
Sig (r3)
Sig (r4)
Sig (r1)
Sig (Sig (rr22)) Sig (Sig (ss11)) ≠≠ ΦΦ
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3838
EnumerationEnumeration
s1
s5
s2
s3
s4
Sig (s2)
Sig (s5)
Sig (s3)
Sig (s4)UU
r1
r5
r2
r3
r4
Sig (s1)
Sig (r2)
Sig (r5)
Sig (r3)
Sig (r4)
Sig (r1)
Sig (Sig (rr22)) Sig (Sig (ss11)) ≠≠ ΦΦ
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3939
EnumerationEnumeration
s1
s5
s2
s3
s4
Sig (s2)
Sig (s5)
Sig (s3)
Sig (s4)UU
r1
r5
r2
r3
r4
Sig (s1)
Sig (r2)
Sig (r5)
Sig (r3)
Sig (r4)
Sig (r1)
Sig (Sig (rr22)) Sig (Sig (ss11)) ≠≠ ΦΦ
OutputOutput False positive candidate pairsFalse positive candidate pairs
S (Id, Elem)
R.Sig = S.Sig
δ R.Id, S.Id
R (Id, Elem)
Post-Process each R.Id, S.Id
Gen SignaturesGen Signatures
S’ (Id, Sig)R’ (Id, Sig)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4141
No False Positive Candidate No False Positive Candidate PairPair
2 3 4 51
s
r
|r Δ s | = 5
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4242
False Positive Candidate False Positive Candidate PairPair
s2
s1
2 3 4 51
|r Δ s | = 5
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4343
Enumeration: PerformanceEnumeration: Performance
0
0.25
0.5
0.75
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Symmetric Difference
Pro
bab
ility
of
Co
mm
on
Sig
nat
ure
k = 4
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4444
Enumeration: PerformanceEnumeration: Performance
0
0.25
0.5
0.75
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Symmetric Difference
Pro
bab
ility
of
Co
mm
on
Sig
nat
ure
Ideal PerformanceIdeal Performance
k = 4
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4545
EnumerationEnumeration
|r Δ s | ≤ 4
s
r
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4646
EnumerationEnumeration
2 3 4 61 5
s
r
|r Δ s | ≤ 4
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4747
Enumeration: Signature Enumeration: Signature GenerationGeneration
s1
2 3 4 61 5
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4848
Enumeration: Signature Enumeration: Signature GenerationGeneration
s1
2 3 4 61 5
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4949
Enumeration: Signature Enumeration: Signature GenerationGeneration
s1
2 3 4 61 5
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5050
Enumeration: Signature Enumeration: Signature GenerationGeneration
s1
2 3 4 61 5
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5151
Enumeration: Signature Enumeration: Signature GenerationGeneration
s1
2 3 4 61 5
( )( )6622
= 15= 15
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5252
AlgorithmAlgorithm
Generate signatures for each Generate signatures for each rrii , , ssjj
Enumerate (Enumerate (rrii , s , sjj ) s.t ) s.t Sig ( Sig (rrii ) Sig () Sig (ssjj ) ) ≠ ≠ ΦΦ
Output those satisfying |Output those satisfying |rrii ΔΔ ssjj | ≤ 4| ≤ 4
U
Only the signature function changesOnly the signature function changes
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5353
Enumeration: PerformanceEnumeration: Performance
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Symmetric Difference
Pro
b. o
f Com
mon
Sig
natu
re
n1 = 5 n1 = 6
k = 4
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5454
False Positive Candidate False Positive Candidate PairPair
2 3 4 61 5
s
r
|r Δ s | = 5
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5555
Enumeration: PerformanceEnumeration: Performance
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Symmetric Difference
Prob
. of C
omm
on S
igna
ture
n1 = 5 n1 = 6 n1 = 7 n1 = 20
k = 4
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5656
Enumeration: PerformanceEnumeration: Performance
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Symmetric Difference
Prob
. of C
omm
on S
igna
ture
n1 = 5 n1 = 6 n1 = 7 n1 = 20
55
15153535
48454845
k = 4
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5757
PartEnum: Divide and PartEnum: Divide and ConquerConquer
s1
21
k = 4
k2 = 1k1 = 2
Generate signatures using EnumerationGenerate signatures using Enumeration
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5858
PartEnum: Asymptotic PartEnum: Asymptotic PerformancePerformance
Theorem: There is an instance of Theorem: There is an instance of PartEnum such that: PartEnum such that: If If ||r r ΔΔ s s || > 7.5 > 7.5 kk, , then then r r and and s s do not do not
share a signature with probability 1 – share a signature with probability 1 – o(1)o(1)
The number of signatures per set: The number of signatures per set: O (O (kk22 ) )
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5959
PartEnum: SummaryPartEnum: Summary
Set-Similarity Joins with predicate Set-Similarity Joins with predicate ||rr ΔΔ ss | ≤ | ≤ kk
Theoretical guaranteesTheoretical guarantees First exact algorithmFirst exact algorithm
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6060
Other resultsOther results
PartEnum extensions:PartEnum extensions: Larger class of set-similarity join predicatesLarger class of set-similarity join predicates
JaccardJaccard Basic idea: reduce to symmetric set Basic idea: reduce to symmetric set
differencedifference WtEnumWtEnum class of signature functions: class of signature functions:
Use frequency of elementsUse frequency of elements Weighted set-similarity joinsWeighted set-similarity joins
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6161
OutlineOutline
IntroductionIntroduction AlgorithmsAlgorithms ExperimentsExperiments ConclusionConclusion
S (Id, Elem)
R.Sig = S.Sig
δ R.Id, S.Id
R (Id, Elem)
Post-Process each R.Id, S.Id
Gen SignaturesGen Signatures
Implementation
DBMSDBMS
Client + DBMSClient + DBMS
DBMSDBMS
ClientClient
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6363
Previous WorkPrevious Work
Prefix Filtering [CGK ’06]Prefix Filtering [CGK ’06] ExactExact
Locality Sensitive Hashing [IM ’98]Locality Sensitive Hashing [IM ’98] ApproximateApproximate False negative rate: 5%False negative rate: 5%
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6464
Data SetsData Sets
Organization addresses [MS Sales]Organization addresses [MS Sales] Concatenation: Org name, street, city, Concatenation: Org name, street, city,
zipzip Input size: 1 millionInput size: 1 million Avg. length: 11 words, 58 charsAvg. length: 11 words, 58 chars Tokenization: Words, n-gramsTokenization: Words, n-grams
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6565
Jaccard, 1M, MS SalesJaccard, 1M, MS Sales
0
1000
2000
3000
4000
PEN LSH PF PEN LSH PF PEN LSH PF
Sec
on
ds
SigGen CandPair PostFilter
0.80.9 0.85
S (Id, Elem)
R.Sig = S.Sig
δ R.Id, S.Id
R (Id, Elem)
Post-Process each R.Id, S.Id
Gen SignaturesGen Signatures
Evaluation
DBMSDBMS
DBMSDBMS
IntermediateIntermediateResult sizeResult size
Client + DBMSClient + DBMS
ClientClient
Jaccard, 1M, MS SalesJaccard, 1M, MS Sales
0.00E+00
5.00E+07
1.00E+08
1.50E+08
2.00E+08
2.50E+08
PEN LSH PF PEN LSH PF PEN LSH PF
Inte
rmed
iate
Res
ult
Siz
e
0.80.9 0.85
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6868
Jaccard, SyntheticJaccard, Synthetic
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
1.0E+08
1.0E+09
1.0E+10
1.0E+11
1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07 1.0E+08 1.0E+09
Input Size
Inte
rmed
iate
Res
ult S
ize
LSH(0.95) PEN PF
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6969
Similar Results for …Similar Results for …
Other data setsOther data sets DBLP, Synthetic data setsDBLP, Synthetic data sets
Other similarity functionsOther similarity functions Weighted jaccardWeighted jaccard Edit distanceEdit distance
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 7070
ConclusionConclusion
New algorithms for set-similarity New algorithms for set-similarity joinsjoins ExactExact Performance guaranteesPerformance guarantees Outperform previous exact algorithmsOutperform previous exact algorithms
Search: “data cleaning project”
Recommended