Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
SOFTJOINSWITHTFIDF:WHYANDWHAT
1
Sim Joins on Product Descriptions
• Surface similarity can be high for descriptions of distinct items:
o AERO TGX-Series Work Table -42'' x 96'' Model 1TGX-4296 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop edge ...
o AERO TGX-Series Work Table -42'' x 48'' Model 1TGX-4248 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop ..
• Surface similarity can be low for descriptions of identical items:
o Canon Angle Finder C 2882A002 Film Camera Angle Finders Right Angle Finder C (Includes ED-C & ED-D Adapters for All SLR Cameras) Film Camera Angle Finders & Magnifiers The Angle Finder C lets you adjust ...
o CANON 2882A002 ANGLE FINDER C FOR EOS REBEL® SERIES PROVIDES A FULL SCREEN IMAGE SHOWS EXPOSURE DATA BUILT-IN DIOPTRIC ADJUSTMENT COMPATIBLE WITH THE CANON® REBEL, EOS & REBEL EOS SERIES.
2
Motivation• Integratingdataisimportant• Datafromdifferentsourcesmaynothaveconsistentobjectidentifiers– Especiallyautomatically-constructedones
• Butdatabaseswillhavehuman-readablenamesfortheobjects
• Butnamesaretricky….
3
4
Onesolution:Soft(Similarity)joins• AsimilarityjoinoftwosetsAandBis
– anorderedlistoftriples(sij,ai,bj)suchthat• ai isfromA• bj isfromB• sij isthesimilarity ofai andbj• thetriplesareindescendingorder
• thelistiseitherthetopKtriplesbysij orALLtripleswithsij>L…orsometimessomeapproximationofthese….
5
Example:softjoins/similarityjoins
… …
Input: Two Different Lists of Entity Names
6
TFIDFsimilarityworkswellfornames
… …
• Low weight for mismatches on common abbreviations (NPRES) • High weight for mismatches on unusual terms (Appomattox,
Aniakchak, …)
7
Example:softjoins/similarityjoinsOutput: Pairs of Names Ranked by Similarity
…
…
identical
similar
less similar
8
HowwelldoesTFIDFwork?
9
10
There are refinements to TFIDF distance – eg ones that extend with soft matching at the token level (e.g., softTFIDF)
11
SOFTJOINSWITHTFIDF:HOW?
12
TFIDFsimilarity
DF(w) = # different docs w occurs in TF(w,d) = # different times w occurs in doc d
IDF(w) = |D |DF(w)
u(w,d) = log(TF(w,d)+1) ⋅ log(IDF(w))u(d) = u(w1,d),....,u(w|V |,d)
v(d) = u(d)||u(d) ||2
sim(v(d1),v(d2 )) = v(d1) ⋅v(d2 ) = u(w,d1)||u(d1) ||2w
∑ u(w,d2 )||u(d2 ) ||2
13
SoftTFIDFjoins• AsimilarityjoinoftwosetsofTFIDF-weightedvectorsAandBis– anorderedlistoftriples(sij,ai,bj)suchthat
• ai isfromA• bj isfromB• sij isthedotproductofai andbj• thetriplesareindescendingorder
• thelistiseitherthetopKtriplesbysij orALLtripleswithsij>L…orsometimessomeapproximationofthese….
• donotcomparesimilarityofallpairs!
14
PARALLELSOFTJOINS
15
SIGMOD 2010
Adapted to Guinea Pig….
16
ParallelInvertedIndexSoftjoin - 1
want this to work for long documents or short ones…and keep the relations simple
Statistics for computing TFIDF with IDFs local to each relation
sumSquareWeights
17
ParallelInvertedIndexSoftjoin - 2
What’sthealgorithm?• Step1:createdocumentvectorsas(Cd,d,term,weight)tuples
• Step2:jointhetuplesfromAandB:onesortandreduce• Givesyoutuples(a,b,term,w(a,term)*w(b,term))
• Step3:groupthecommontermsby(a,b)andreducetoaggregatethecomponentsofthesum
This join can be very big for
common terms
18
Makingthealgorithmsmarter….
19
InvertedIndexSoftjoin - 2
we should make a smart choice about which terms to use
20
Addingheuristicstothesoftjoin- 1
21
Addingheuristicstothesoftjoin- 2
22
Addingheuristics• Parks:
– input40k–data60k–docvec 102k– softjoin
• 539ktokens• 508kdocuments• 0errorsintop50
• w/heuristics:– input40k–data60k–docvec 102k– softjoin
• 32ktokens• 24kdocuments• 3errorsintop50• <400usefulterms
23
Addingheuristics• SOvsWikipedia:
– input612M– docvec 1050M– softjoin 3.7G– jointime:54.30m
• withheuristics– maxDF=50
– softjoin 2.5G– jointime27:30m
• withheuristics– maxDF=20
– softjoin 1.2G– jointime11:50m
– includes70%ofthepairswithscore>=0.95Single-threaded Java
baseline: 2-3 days
24