Upload
ellie
View
54
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Adaptive Near-Duplicate Detection via Similarity Learning. Scott Wen-tau Yih (Microsoft Research) Joint work with Hannaneh Hajishirzi (University of Illinois) Aleksander Kolcz (Microsoft Bing). Same a rticle. Subject: The most popular 400% on first deposit Dear Player : ) - PowerPoint PPT Presentation
Citation preview
Adaptive Near-Duplicate Detection via Similarity Learning
Scott Wen-tau Yih (Microsoft Research)Joint work with Hannaneh Hajishirzi (University of Illinois) Aleksander Kolcz (Microsoft Bing)
Same article
Subject: The most popular 400% on first depositDear Player: )
They offer a multi-levelled bonus, which if completed earns you a total o= 2400.
take your 400% right now on your first depositGet Started right now >>> http://docs.google.com/View?id=df67bssq_0cfwjq=x4
__________________________Windows Live?: Keep your life in sync. http://windowslive.com/explore?ocid=TXT_TAGLM_WL_t1_allup_explore_012009
Subject: sweet dream 400% on first depositDear Player: )
bets in light of the new legislation passed threatening the entire online g=ming ...
take your 400% right now on your first depositGet Started right now >>> http://docs.google.com/View?id=dfbgtp2q_0xh9sp=7h
_________________________________________________________________News, entertainment and everything you care about at Live.com. Get it now=http://www.live.com/getstarted.aspx=Nothing can be better than buying a good with a discount.
Same payload info
Applications of Near-duplicate DetectionSearch Engines
Smaller index and storage of crawled pagesPresent non-redundant information
Email spam filteringSpam campaign detection
Online AdvertisingWeb plagiarism detection Not showing content ads on low quality pages
Traditional Approaches Efficient document similarity computation
Encode doc into hash code(s) with fixed-sizeDocs with identical hash code(s) duplicate
Very fast – little document processing
Difficult to fine-tune the algorithm to achieve high accuracy across different domains
e.g., “news pages” “spam email”
Challenges of Improving NDD AccuracyCapture the notion of “near-duplicate”
Whether a document fragment is important depends on the target application
Generalize well for future datae.g., identify important names even if they were unseen before
Preserve efficiencyMost applications target large document sets; cannot sacrifice efficiency for accuracy
Adaptive Near-duplicate DetectionImproves accuracy by learning a better
document representationLearns the notion of “near-duplicate” from (a small number of) labeled documents
Has a simple feature designAlleviates out-of-vocabulary problem, generalizes wellEasy to evaluate, little additional computation
Plugs in a learning componentCan be easily combined with existing NDD methods
OutlineIntroductionAdaptive Near-duplicate Detection
A unified view of NDD methodsImprove accuracy via similarity learning
ExperimentsConclusions
A Unified View of NDD Methods1. Term vector construction ()2. Signature generation ()3. Document comparison
0 1 1 0 1
𝑑 𝑑𝑣𝑒𝑐AB12FE01
23458DFA1
5
𝑑𝑠𝑖𝑔
1 1 0 0 1
𝑑 ′ 𝑑𝑣𝑒𝑐′
009F12485
3458DFA15
𝑑𝑠𝑖𝑔′𝑓 𝑠𝑖𝑚𝑆𝑖𝑚 𝑂𝑣𝑒𝑟𝑙𝑎𝑝
A Unified View of NDD MethodsTerm vector construction ()
Select -grams from the raw document Shingles: , all -grams I-Match: , -grams with mid idf values SpotSigs: skip -grams after stop words [Theobald et al.
‘08]Create -gram vector with binary/TFIDF weightingBP to proceed with pressure test on leaking well …
0 1 1 0 1
For example, =1 “proceed” “pressure” “leaking”
𝑑 𝑑𝑣𝑒𝑐
A Unified View of NDD MethodsSignature generation ()
For efficient document comparison and processingEncode document into a set of hash code(s) Shingles: MinHash I-Match: SHA1 (single hash value) Charikar’s random projection: SimHash [Henzinger ‘06]
0 1 1 0 1
𝑑 𝑑𝑣𝑒𝑐AB12FE01
23458DFA1
5009F12485
…
𝑑𝑠𝑖𝑔
A Unified View of NDD MethodsDocument Comparison
Documents are near-duplicate if Signature generation schemes depend on
Jaccard MinHash; Cosine SimHash
0 1 1 0 1
𝑑 𝑑𝑣𝑒𝑐AB12FE01
23458DFA1
5
𝑑𝑠𝑖𝑔
1 1 0 0 1
𝑑 ′ 𝑑𝑣𝑒𝑐′
009F12485
3458DFA15
𝑑𝑠𝑖𝑔′𝑓 𝑠𝑖𝑚𝑆𝑖𝑚 𝑂𝑣𝑒𝑟𝑙𝑎𝑝
Key to Improving NDD AccuracyQuality of the term vectors determines
the final prediction accuracyHashing schemes approximate the vector similarity function (e.g., cosine and Jaccard)
0 1 1 0 1
𝑑 𝑑𝑣𝑒𝑐AB12FE01
23458DFA1
5
𝑑𝑠𝑖𝑔
1 1 0 0 1
𝑑 ′ 𝑑𝑣𝑒𝑐′
009F12485
3458DFA15
𝑑𝑠𝑖𝑔′𝑓 𝑠𝑖𝑚𝑆𝑖𝑚 𝑂𝑣𝑒𝑟𝑙𝑎𝑝
Adaptive NDD: The Learning ComponentCreate term vectors with different term-
weighting scoresScores are determined by -gram properties
TF, DF, Position, IsCapital, AnchorText, etc.Scores indicate the importance of document fragments and are learned using side information
0 0
𝑑 𝑑𝑣𝑒𝑐
Term Vector Document Similarity
Weight of : Learn the model parameters
𝑓 𝑠𝑖𝑚(𝑑𝑣𝑒𝑐 ,𝑑𝑣𝑒𝑐′ )
𝑑𝑣𝑒𝑐 𝑑𝑣𝑒𝑐′
𝜙1 (𝑡𝑖 , 𝑑)⋯ 𝜙𝑚 (𝑡 𝑖 ,𝑑 ) 𝜙1 (𝑡𝑖 ,𝑑 ′ )⋯ 𝜙𝑚 ( 𝑡𝑖 ,𝑑 ′ )
FeaturesDoc-independent features
Evaluated by table lookupe.g., Doc frequency (DF), Query frequency (QF)
Doc-dependent featuresEvaluated by linear scane.g., Term frequency (TF), Term location (Loc)
No lexical features usedVery easy to compute
Training ProcedureTraining data:
Possible loss functions:Sum squared error:
Log-loss, Pairwise loss
Training can be done using gradient-based methods, such as L-BFGS
OutlineIntroductionAdaptive Near-duplicate DetectionExperiments
Data sets: News & EmailQuality of raw vector representationsQuality of document signaturesLearning curve
Conclusions
Data SetsWeb News Articles (News)
Near-duplicate news pages [Theobald et al. SIGIR-08]68 clusters; 2160 news articles in total5 times 2-fold cross-validation
Hotmail Outbound Messages (Email)Training: 400 clusters (2,256 msg) from Dec 2008Testing: 475 clusters (658 msg) from Jan 2009Initial clusters selected using Shingle and I-Match; labels are further corrected manually
Quality of Raw Vector Representation
News Dataset
0.4
0.6
0.8
1 0.956 0.9520.8840.861
0.953 0.952
ANDD-Raw TFIDF Binary SpotSigs
Unigram ()
Cosine Jaccard
Max
Sco
re
Quality of Raw Vector Representation
Email Dataset
Max
Sco
re
00.10.20.30.40.50.60.70.8
0.674 0.70.625 0.6260.622 0.622
0.257 0.258
ANDD-Raw TFIDF Binary SpotSigsUnigram (𝑛=1)
Cosine Jaccard
Quality of Document SignatureNews Dataset
0.75
0.8
0.85
0.9
0.95
10.943 0.931
0.82
0.876
ANDD-SimHash (48 bytes) ANDD-MinHash (200 bytes)Charikar (48 bytes) Shingles (672 bytes)
Max
Sco
re
Quality of Document SignatureEmail Dataset
0.40.450.5
0.550.6
0.650.7 0.656 0.646
0.591 0.6020.548 0.532
ANDD-SimHash (48 bytes) ANDD-MinHash (200 bytes)Charikar (48 bytes) Shingles (672 bytes)1-Shingles (48 bytes) I-Match (24 bytes)
Max
Sco
re
Learning Curve (News Dataset)
0 50 100 150 200 250 3000.80.820.840.860.880.9
0.920.940.96
Number of Training Documents
Max
F1
Scor
e
Initial Model
Final Model
ConclusionsA novel NDD method: robust to domain change
Learn a better raw -gram vector representationProvide more accurate doc similarity measures
Improve accuracy without sacrificing efficiencySimple features; good quality doc signatures
Require only a few number of training examples
Future workInclude more information from document analysisImprove the similarity function using metric learningLearn the signature generation process