25
Adaptive Near-Duplicate Detection via Similarity Learning Scott Wen-tau Yih (Microsoft Research) Joint work with Hannaneh Hajishirzi (University of Illinois) Aleksander Kolcz (Microsoft Bing)

Adaptive Near-Duplicate Detection via Similarity Learning

  • Upload
    ellie

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

Adaptive Near-Duplicate Detection via Similarity Learning. Scott Wen-tau Yih (Microsoft Research) Joint work with Hannaneh Hajishirzi (University of Illinois) Aleksander Kolcz (Microsoft Bing). Same a rticle. Subject: The most popular 400% on first deposit Dear Player : ) - PowerPoint PPT Presentation

Citation preview

Page 1: Adaptive Near-Duplicate Detection via Similarity Learning

Adaptive Near-Duplicate Detection via Similarity Learning

Scott Wen-tau Yih (Microsoft Research)Joint work with Hannaneh Hajishirzi (University of Illinois) Aleksander Kolcz (Microsoft Bing)

Page 2: Adaptive Near-Duplicate Detection via Similarity Learning

Same article

Page 3: Adaptive Near-Duplicate Detection via Similarity Learning

Subject: The most popular 400% on first depositDear Player: )

They offer a multi-levelled bonus, which if completed earns you a total o= 2400.

take your 400% right now on your first depositGet Started right now >>>  http://docs.google.com/View?id=df67bssq_0cfwjq=x4

__________________________Windows Live?: Keep your life in sync. http://windowslive.com/explore?ocid=TXT_TAGLM_WL_t1_allup_explore_012009

Subject: sweet dream  400% on first depositDear Player: )

bets in light of the new legislation passed threatening the entire online g=ming ...

take your 400% right now on your first depositGet Started right now >>>  http://docs.google.com/View?id=dfbgtp2q_0xh9sp=7h

_________________________________________________________________News, entertainment and everything you care about at Live.com. Get it now=http://www.live.com/getstarted.aspx=Nothing can be better than buying a good with a discount.

Same payload info

Page 4: Adaptive Near-Duplicate Detection via Similarity Learning

Applications of Near-duplicate DetectionSearch Engines

Smaller index and storage of crawled pagesPresent non-redundant information

Email spam filteringSpam campaign detection

Online AdvertisingWeb plagiarism detection Not showing content ads on low quality pages

Page 5: Adaptive Near-Duplicate Detection via Similarity Learning

Traditional Approaches Efficient document similarity computation

Encode doc into hash code(s) with fixed-sizeDocs with identical hash code(s) duplicate

Very fast – little document processing

Difficult to fine-tune the algorithm to achieve high accuracy across different domains

e.g., “news pages” “spam email”

Page 6: Adaptive Near-Duplicate Detection via Similarity Learning

Challenges of Improving NDD AccuracyCapture the notion of “near-duplicate”

Whether a document fragment is important depends on the target application

Generalize well for future datae.g., identify important names even if they were unseen before

Preserve efficiencyMost applications target large document sets; cannot sacrifice efficiency for accuracy

Page 7: Adaptive Near-Duplicate Detection via Similarity Learning

Adaptive Near-duplicate DetectionImproves accuracy by learning a better

document representationLearns the notion of “near-duplicate” from (a small number of) labeled documents

Has a simple feature designAlleviates out-of-vocabulary problem, generalizes wellEasy to evaluate, little additional computation

Plugs in a learning componentCan be easily combined with existing NDD methods

Page 8: Adaptive Near-Duplicate Detection via Similarity Learning

OutlineIntroductionAdaptive Near-duplicate Detection

A unified view of NDD methodsImprove accuracy via similarity learning

ExperimentsConclusions

Page 9: Adaptive Near-Duplicate Detection via Similarity Learning

A Unified View of NDD Methods1. Term vector construction ()2. Signature generation ()3. Document comparison

0 1 1 0 1

𝑑 𝑑𝑣𝑒𝑐AB12FE01

23458DFA1

5

𝑑𝑠𝑖𝑔

1 1 0 0 1

𝑑 ′ 𝑑𝑣𝑒𝑐′

009F12485

3458DFA15

𝑑𝑠𝑖𝑔′𝑓 𝑠𝑖𝑚𝑆𝑖𝑚 𝑂𝑣𝑒𝑟𝑙𝑎𝑝

Page 10: Adaptive Near-Duplicate Detection via Similarity Learning

A Unified View of NDD MethodsTerm vector construction ()

Select -grams from the raw document Shingles: , all -grams I-Match: , -grams with mid idf values SpotSigs: skip -grams after stop words [Theobald et al.

‘08]Create -gram vector with binary/TFIDF weightingBP to proceed with pressure test on leaking well …

0 1 1 0 1

For example, =1 “proceed” “pressure” “leaking”

𝑑 𝑑𝑣𝑒𝑐

Page 11: Adaptive Near-Duplicate Detection via Similarity Learning

A Unified View of NDD MethodsSignature generation ()

For efficient document comparison and processingEncode document into a set of hash code(s) Shingles: MinHash I-Match: SHA1 (single hash value) Charikar’s random projection: SimHash [Henzinger ‘06]

0 1 1 0 1

𝑑 𝑑𝑣𝑒𝑐AB12FE01

23458DFA1

5009F12485

𝑑𝑠𝑖𝑔

Page 12: Adaptive Near-Duplicate Detection via Similarity Learning

A Unified View of NDD MethodsDocument Comparison

Documents are near-duplicate if Signature generation schemes depend on

Jaccard MinHash; Cosine SimHash

0 1 1 0 1

𝑑 𝑑𝑣𝑒𝑐AB12FE01

23458DFA1

5

𝑑𝑠𝑖𝑔

1 1 0 0 1

𝑑 ′ 𝑑𝑣𝑒𝑐′

009F12485

3458DFA15

𝑑𝑠𝑖𝑔′𝑓 𝑠𝑖𝑚𝑆𝑖𝑚 𝑂𝑣𝑒𝑟𝑙𝑎𝑝

Page 13: Adaptive Near-Duplicate Detection via Similarity Learning

Key to Improving NDD AccuracyQuality of the term vectors determines

the final prediction accuracyHashing schemes approximate the vector similarity function (e.g., cosine and Jaccard)

0 1 1 0 1

𝑑 𝑑𝑣𝑒𝑐AB12FE01

23458DFA1

5

𝑑𝑠𝑖𝑔

1 1 0 0 1

𝑑 ′ 𝑑𝑣𝑒𝑐′

009F12485

3458DFA15

𝑑𝑠𝑖𝑔′𝑓 𝑠𝑖𝑚𝑆𝑖𝑚 𝑂𝑣𝑒𝑟𝑙𝑎𝑝

Page 14: Adaptive Near-Duplicate Detection via Similarity Learning

Adaptive NDD: The Learning ComponentCreate term vectors with different term-

weighting scoresScores are determined by -gram properties

TF, DF, Position, IsCapital, AnchorText, etc.Scores indicate the importance of document fragments and are learned using side information

0 0

𝑑 𝑑𝑣𝑒𝑐

Page 15: Adaptive Near-Duplicate Detection via Similarity Learning

Term Vector Document Similarity

Weight of : Learn the model parameters

𝑓 𝑠𝑖𝑚(𝑑𝑣𝑒𝑐 ,𝑑𝑣𝑒𝑐′ )

𝑑𝑣𝑒𝑐 𝑑𝑣𝑒𝑐′

𝜙1 (𝑡𝑖 , 𝑑)⋯ 𝜙𝑚 (𝑡 𝑖 ,𝑑 ) 𝜙1 (𝑡𝑖 ,𝑑 ′ )⋯ 𝜙𝑚 ( 𝑡𝑖 ,𝑑 ′ )

Page 16: Adaptive Near-Duplicate Detection via Similarity Learning

FeaturesDoc-independent features

Evaluated by table lookupe.g., Doc frequency (DF), Query frequency (QF)

Doc-dependent featuresEvaluated by linear scane.g., Term frequency (TF), Term location (Loc)

No lexical features usedVery easy to compute

Page 17: Adaptive Near-Duplicate Detection via Similarity Learning

Training ProcedureTraining data:

Possible loss functions:Sum squared error:

Log-loss, Pairwise loss

Training can be done using gradient-based methods, such as L-BFGS

Page 18: Adaptive Near-Duplicate Detection via Similarity Learning

OutlineIntroductionAdaptive Near-duplicate DetectionExperiments

Data sets: News & EmailQuality of raw vector representationsQuality of document signaturesLearning curve

Conclusions

Page 19: Adaptive Near-Duplicate Detection via Similarity Learning

Data SetsWeb News Articles (News)

Near-duplicate news pages [Theobald et al. SIGIR-08]68 clusters; 2160 news articles in total5 times 2-fold cross-validation

Hotmail Outbound Messages (Email)Training: 400 clusters (2,256 msg) from Dec 2008Testing: 475 clusters (658 msg) from Jan 2009Initial clusters selected using Shingle and I-Match; labels are further corrected manually

Page 20: Adaptive Near-Duplicate Detection via Similarity Learning

Quality of Raw Vector Representation

News Dataset

0.4

0.6

0.8

1 0.956 0.9520.8840.861

0.953 0.952

ANDD-Raw TFIDF Binary SpotSigs

Unigram ()

Cosine Jaccard

Max

Sco

re

Page 21: Adaptive Near-Duplicate Detection via Similarity Learning

Quality of Raw Vector Representation

Email Dataset

Max

Sco

re

00.10.20.30.40.50.60.70.8

0.674 0.70.625 0.6260.622 0.622

0.257 0.258

ANDD-Raw TFIDF Binary SpotSigsUnigram (𝑛=1)

Cosine Jaccard

Page 22: Adaptive Near-Duplicate Detection via Similarity Learning

Quality of Document SignatureNews Dataset

0.75

0.8

0.85

0.9

0.95

10.943 0.931

0.82

0.876

ANDD-SimHash (48 bytes) ANDD-MinHash (200 bytes)Charikar (48 bytes) Shingles (672 bytes)

Max

Sco

re

Page 23: Adaptive Near-Duplicate Detection via Similarity Learning

Quality of Document SignatureEmail Dataset

0.40.450.5

0.550.6

0.650.7 0.656 0.646

0.591 0.6020.548 0.532

ANDD-SimHash (48 bytes) ANDD-MinHash (200 bytes)Charikar (48 bytes) Shingles (672 bytes)1-Shingles (48 bytes) I-Match (24 bytes)

Max

Sco

re

Page 24: Adaptive Near-Duplicate Detection via Similarity Learning

Learning Curve (News Dataset)

0 50 100 150 200 250 3000.80.820.840.860.880.9

0.920.940.96

Number of Training Documents

Max

F1

Scor

e

Initial Model

Final Model

Page 25: Adaptive Near-Duplicate Detection via Similarity Learning

ConclusionsA novel NDD method: robust to domain change

Learn a better raw -gram vector representationProvide more accurate doc similarity measures

Improve accuracy without sacrificing efficiencySimple features; good quality doc signatures

Require only a few number of training examples

Future workInclude more information from document analysisImprove the similarity function using metric learningLearn the signature generation process