24
SOFT JOINS WITH TFIDF: WHY AND WHAT 1

SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

SOFTJOINSWITHTFIDF:WHYANDWHAT

1

Page 2: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

Sim Joins on Product Descriptions

• Surface similarity can be high for descriptions of distinct items:

o AERO TGX-Series Work Table -42'' x 96'' Model 1TGX-4296 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop edge ...

o AERO TGX-Series Work Table -42'' x 48'' Model 1TGX-4248 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop ..

• Surface similarity can be low for descriptions of identical items:

o Canon Angle Finder C 2882A002 Film Camera Angle Finders Right Angle Finder C (Includes ED-C & ED-D Adapters for All SLR Cameras) Film Camera Angle Finders & Magnifiers The Angle Finder C lets you adjust ...

o CANON 2882A002 ANGLE FINDER C FOR EOS REBEL® SERIES PROVIDES A FULL SCREEN IMAGE SHOWS EXPOSURE DATA BUILT-IN DIOPTRIC ADJUSTMENT COMPATIBLE WITH THE CANON® REBEL, EOS & REBEL EOS SERIES.

2

Page 3: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

Motivation• Integratingdataisimportant• Datafromdifferentsourcesmaynothaveconsistentobjectidentifiers– Especiallyautomatically-constructedones

• Butdatabaseswillhavehuman-readablenamesfortheobjects

• Butnamesaretricky….

3

Page 4: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

4

Page 5: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

Onesolution:Soft(Similarity)joins• AsimilarityjoinoftwosetsAandBis

– anorderedlistoftriples(sij,ai,bj)suchthat• ai isfromA• bj isfromB• sij isthesimilarity ofai andbj• thetriplesareindescendingorder

• thelistiseitherthetopKtriplesbysij orALLtripleswithsij>L…orsometimessomeapproximationofthese….

5

Page 6: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

Example:softjoins/similarityjoins

… …

Input: Two Different Lists of Entity Names

6

Page 7: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

TFIDFsimilarityworkswellfornames

… …

• Low weight for mismatches on common abbreviations (NPRES) • High weight for mismatches on unusual terms (Appomattox,

Aniakchak, …)

7

Page 8: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

Example:softjoins/similarityjoinsOutput: Pairs of Names Ranked by Similarity

identical

similar

less similar

8

Page 9: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

HowwelldoesTFIDFwork?

9

Page 10: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

10

Page 11: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

There are refinements to TFIDF distance – eg ones that extend with soft matching at the token level (e.g., softTFIDF)

11

Page 12: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

SOFTJOINSWITHTFIDF:HOW?

12

Page 13: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

TFIDFsimilarity

DF(w) = # different docs w occurs in TF(w,d) = # different times w occurs in doc d

IDF(w) = |D |DF(w)

u(w,d) = log(TF(w,d)+1) ⋅ log(IDF(w))u(d) = u(w1,d),....,u(w|V |,d)

v(d) = u(d)||u(d) ||2

sim(v(d1),v(d2 )) = v(d1) ⋅v(d2 ) = u(w,d1)||u(d1) ||2w

∑ u(w,d2 )||u(d2 ) ||2

13

Page 14: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

SoftTFIDFjoins• AsimilarityjoinoftwosetsofTFIDF-weightedvectorsAandBis– anorderedlistoftriples(sij,ai,bj)suchthat

• ai isfromA• bj isfromB• sij isthedotproductofai andbj• thetriplesareindescendingorder

• thelistiseitherthetopKtriplesbysij orALLtripleswithsij>L…orsometimessomeapproximationofthese….

• donotcomparesimilarityofallpairs!

14

Page 15: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

PARALLELSOFTJOINS

15

Page 16: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

SIGMOD 2010

Adapted to Guinea Pig….

16

Page 17: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

ParallelInvertedIndexSoftjoin - 1

want this to work for long documents or short ones…and keep the relations simple

Statistics for computing TFIDF with IDFs local to each relation

sumSquareWeights

17

Page 18: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

ParallelInvertedIndexSoftjoin - 2

What’sthealgorithm?• Step1:createdocumentvectorsas(Cd,d,term,weight)tuples

• Step2:jointhetuplesfromAandB:onesortandreduce• Givesyoutuples(a,b,term,w(a,term)*w(b,term))

• Step3:groupthecommontermsby(a,b)andreducetoaggregatethecomponentsofthesum

This join can be very big for

common terms

18

Page 19: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

Makingthealgorithmsmarter….

19

Page 20: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

InvertedIndexSoftjoin - 2

we should make a smart choice about which terms to use

20

Page 21: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

Addingheuristicstothesoftjoin- 1

21

Page 22: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

Addingheuristicstothesoftjoin- 2

22

Page 23: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

Addingheuristics• Parks:

– input40k–data60k–docvec 102k– softjoin

• 539ktokens• 508kdocuments• 0errorsintop50

• w/heuristics:– input40k–data60k–docvec 102k– softjoin

• 32ktokens• 24kdocuments• 3errorsintop50• <400usefulterms

23

Page 24: SOFT JOINS WITH TFIDF: WHY AND WHATwcohen/10-605/simjoins-catchup.pdf · Soft TFIDF joins •A similarity join of two sets of TFIDF-weighted vectors A and B is –an ordered list

Addingheuristics• SOvsWikipedia:

– input612M– docvec 1050M– softjoin 3.7G– jointime:54.30m

• withheuristics– maxDF=50

– softjoin 2.5G– jointime27:30m

• withheuristics– maxDF=20

– softjoin 1.2G– jointime11:50m

– includes70%ofthepairswithscore>=0.95Single-threaded Java

baseline: 2-3 days

24