Adaptive Near-Duplicate Detection via Similarity Learning

Adaptive Near-Duplicate Detection via Similarity Learning

Scott Wen-tau Yih (Microsoft Research)Joint work with Hannaneh Hajishirzi (University of Illinois) Aleksander Kolcz (Microsoft Bing)

Same article

Subject: The most popular 400% on first depositDear Player: )

They offer a multi-levelled bonus, which if completed earns you a total o= 2400.

take your 400% right now on your first depositGet Started right now >>> http://docs.google.com/View?id=df67bssq_0cfwjq=x4

__________________________Windows Live?: Keep your life in sync. http://windowslive.com/explore?ocid=TXT_TAGLM_WL_t1_allup_explore_012009

Subject: sweet dream 400% on first depositDear Player: )

bets in light of the new legislation passed threatening the entire online g=ming ...

take your 400% right now on your first depositGet Started right now >>> http://docs.google.com/View?id=dfbgtp2q_0xh9sp=7h

_________________________________________________________________News, entertainment and everything you care about at Live.com. Get it now=http://www.live.com/getstarted.aspx=Nothing can be better than buying a good with a discount.

Same payload info

http://docs.google.com/View?id=df67bssq_0cfwjq=x4

http://docs.google.com/View?id=dfbgtp2q_0xh9sp=7h

http://www.live.com/getstarted.aspx

Applications of Near-duplicate DetectionSearch Engines

Smaller index and storage of crawled pagesPresent non-redundant information

Email spam filteringSpam campaign detection

Online AdvertisingWeb plagiarism detection Not showing content ads on low quality pages

Traditional Approaches Efficient document similarity computation

Encode doc into hash code(s) with fixed-sizeDocs with identical hash code(s) duplicate

Very fast – little document processing

Difficult to fine-tune the algorithm to achieve high accuracy across different domains

e.g., “news pages” “spam email”

Challenges of Improving NDD AccuracyCapture the notion of “near-duplicate”

Whether a document fragment is important depends on the target application

Generalize well for future datae.g., identify important names even if they were unseen before

Preserve efficiencyMost applications target large document sets; cannot sacrifice efficiency for accuracy

Adaptive Near-duplicate DetectionImproves accuracy by learning a better

document representationLearns the notion of “near-duplicate” from (a small number of) labeled documents

Has a simple feature designAlleviates out-of-vocabulary problem, generalizes wellEasy to evaluate, little additional computation

Plugs in a learning componentCan be easily combined with existing NDD methods

OutlineIntroductionAdaptive Near-duplicate Detection

A unified view of NDD methodsImprove accuracy via similarity learning

ExperimentsConclusions

A Unified View of NDD Methods1. Term vector construction ()2. Signature generation ()3. Document comparison

0 1 1 0 1

𝑑 𝑑𝑣𝑒𝑐AB12FE01

23458DFA1

5

𝑑𝑠𝑖𝑔

1 1 0 0 1

𝑑 ′ 𝑑𝑣𝑒𝑐′

009F12485

3458DFA15

𝑑𝑠𝑖𝑔′𝑓 𝑠𝑖𝑚𝑆𝑖𝑚 𝑂𝑣𝑒𝑟𝑙𝑎𝑝

A Unified View of NDD MethodsTerm vector construction ()

Select -grams from the raw document Shingles: , all -grams I-Match: , -grams with mid idf values SpotSigs: skip -grams after stop words [Theobald et al.

‘08]Create -gram vector with binary/TFIDF weightingBP to proceed with pressure test on leaking well …

0 1 1 0 1

For example, =1 “proceed” “pressure” “leaking”

𝑑 𝑑𝑣𝑒𝑐

A Unified View of NDD MethodsSignature generation ()

For efficient document comparison and processingEncode document into a set of hash code(s) Shingles: MinHash I-Match: SHA1 (single hash value) Charikar’s random projection: SimHash [Henzinger ‘06]

0 1 1 0 1


23458DFA1

5009F12485

…

𝑑𝑠𝑖𝑔

A Unified View of NDD MethodsDocument Comparison

Documents are near-duplicate if Signature generation schemes depend on

Jaccard MinHash; Cosine SimHash

0 1 1 0 1


23458DFA1

5

𝑑𝑠𝑖𝑔

1 1 0 0 1


009F12485

3458DFA15


Key to Improving NDD AccuracyQuality of the term vectors determines

the final prediction accuracyHashing schemes approximate the vector similarity function (e.g., cosine and Jaccard)

0 1 1 0 1


23458DFA1

5

𝑑𝑠𝑖𝑔

1 1 0 0 1


009F12485

3458DFA15


Adaptive NDD: The Learning ComponentCreate term vectors with different term-

weighting scoresScores are determined by -gram properties

TF, DF, Position, IsCapital, AnchorText, etc.Scores indicate the importance of document fragments and are learned using side information

0 0

𝑑 𝑑𝑣𝑒𝑐

Term Vector Document Similarity

Weight of : Learn the model parameters

𝑓 𝑠𝑖𝑚(𝑑𝑣𝑒𝑐 ,𝑑𝑣𝑒𝑐′ )

𝑑𝑣𝑒𝑐 𝑑𝑣𝑒𝑐′

𝜙1 (𝑡𝑖 , 𝑑)⋯ 𝜙𝑚 (𝑡 𝑖 ,𝑑 ) 𝜙1 (𝑡𝑖 ,𝑑 ′ )⋯ 𝜙𝑚 ( 𝑡𝑖 ,𝑑 ′ )

FeaturesDoc-independent features

Evaluated by table lookupe.g., Doc frequency (DF), Query frequency (QF)

Doc-dependent featuresEvaluated by linear scane.g., Term frequency (TF), Term location (Loc)

No lexical features usedVery easy to compute

Training ProcedureTraining data:

Possible loss functions:Sum squared error:

Log-loss, Pairwise loss

Training can be done using gradient-based methods, such as L-BFGS

OutlineIntroductionAdaptive Near-duplicate DetectionExperiments

Data sets: News & EmailQuality of raw vector representationsQuality of document signaturesLearning curve

Conclusions

Data SetsWeb News Articles (News)

Near-duplicate news pages [Theobald et al. SIGIR-08]68 clusters; 2160 news articles in total5 times 2-fold cross-validation

Hotmail Outbound Messages (Email)Training: 400 clusters (2,256 msg) from Dec 2008Testing: 475 clusters (658 msg) from Jan 2009Initial clusters selected using Shingle and I-Match; labels are further corrected manually

Quality of Raw Vector Representation

News Dataset

0.4

0.6

0.8

1 0.956 0.9520.8840.861

0.953 0.952

ANDD-Raw TFIDF Binary SpotSigs

Unigram ()

Cosine Jaccard

Max

Sco

re

Quality of Raw Vector Representation

Email Dataset

Max

Sco

re

00.10.20.30.40.50.60.70.8

0.674 0.70.625 0.6260.622 0.622

0.257 0.258

ANDD-Raw TFIDF Binary SpotSigsUnigram (𝑛=1)

Cosine Jaccard

Quality of Document SignatureNews Dataset

0.75

0.8

0.85

0.9

0.95

10.943 0.931

0.82

0.876

ANDD-SimHash (48 bytes) ANDD-MinHash (200 bytes)Charikar (48 bytes) Shingles (672 bytes)

Max

Sco

re

Quality of Document SignatureEmail Dataset

0.40.450.5

0.550.6

0.650.7 0.656 0.646

0.591 0.6020.548 0.532

ANDD-SimHash (48 bytes) ANDD-MinHash (200 bytes)Charikar (48 bytes) Shingles (672 bytes)1-Shingles (48 bytes) I-Match (24 bytes)

Max

Sco

re

Learning Curve (News Dataset)

0 50 100 150 200 250 3000.80.820.840.860.880.9

0.920.940.96

Number of Training Documents

Max

F1

Scor

e

Initial Model

Final Model

ConclusionsA novel NDD method: robust to domain change

Learn a better raw -gram vector representationProvide more accurate doc similarity measures

Improve accuracy without sacrificing efficiencySimple features; good quality doc signatures

Require only a few number of training examples

Future workInclude more information from document analysisImprove the similarity function using metric learningLearn the signature generation process