Upload
nitish-upreti
View
235
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Term Extraction in Plagiarism Detection
Citation preview
IST 441 Query Formulation for Similarity Search
Student : Nitish Upreti
Customer : Kyle Williams
OUTLINE
• Introduction • Motivation• Challenges with Similarity Search• Background & Reference Point• Approaches to Similarity Search• Our Approaches to Problem• JateToolkit Introduction• Solution Architecture• Evaluation• Conclusion
What is Similarity Search?
“ Given a sample document and a standard Websearch engine, the goal is to find similardocuments to the given document. ”
What is a similar document?
• Cosine Similarity
• Citation Similarity
• Code Similarity
• Multimedia Content Similarity
Motivation
Plagiarism DetectionProcess of locating instances of plagiarism in asuspicious document from the web.
Example : Turnitin™
Content Recommendation
Recommending articles from credible news sources basedon social media entities such as tweets.
Academic Scenario : Research Paper Recommendation
Finding relevant documents for research paperrecommendation.
Challenge Involved
• Constructing queries from the sample documentin order to find similar documents is not obvious.
• Several Constraints on the maximum number ofqueries and results to be downloaded forscalability constraints.
• Capture different facets of Similarity :
How can we be general enough to capture thetheme but also specific to capture uniquedocument attributes? (Domain Dependent)
BACKGROUND
The Big Picture
Credits : Plagiarism Candidate Retrieval Using Selective Query Formulation and Discriminative Query ScoringNotebook for PAN at CLEF 2013
Our Reference Point
• Source Retrieval is the KEY component.
(Dictates the possibility of future steps)
• Query Formulation is at the heart of thisproblem.
• Challenges with :
– How can we design better algorithms to formulate accurate queries?
– What has been done and what can be explored?
Our Reference Point (Contd..)
• CLEF: Conference and Labs of the Evaluation Forum.
• PAN Labs centers around the topics ofplagiarism, authorship, and social softwaremisuse.
– Author Identification
– Author Profiling
– Plagiarism Detection
• Evaluation possible in a Plagiarism domain.
Approaching Similarity Search
Major classes of Similarity Search :
• Choosing sentences from text corpus.
• Choosing a set of generic keywords.
• Term Extraction Algorithms.
• Topic Mining for document using MachineLearning techniques.
Mix and Match Ideas depending and employwell known tweaks depending on the scenario.(Most of it is experimental)
Query Formulation ApproachTerm Extraction
(Automatic extraction of relevant terms from a given corpus)
Approach Contd…
• Central Theme : Term Extraction Algorithms
• Approach Similarity Search in context of Term Extraction algorithms.
• Design a framework which incorporates which these algorithms.
• Evaluate the algorithms.
• Document all the approaches.
Enter JateToolkit
Java Automatic Term Extraction toolkitA library of state-of-the-art term extractionalgorithms and framework for developing termextraction algorithms.https://code.google.com/p/jatetoolkit/
Term Extraction Approaches…• Term Extraction Algorithms :
– TF-IDF– RIDF – Weirdness – C-value – GlossEx – TermEx (Open Ended Project : Work in Progress)– Justeson & Katz Algorithm– NC Value Algorithm– Rake Algorithm– Chi-squared Algorithm
Solution Architecture
Phase 1 : Pre-Processing
Pre-Processing Document
StopList Pre-Processing
Extremely common words which would appearto be of little value in helping select documentsmatching a user need are excluded from thevocabulary entirely. These words form the StopList.
• Use Jate’s built in “StopList” for filtering.
Pre-Processing Document Contd…
Lemmatization
Group together words that are present in thedocument as different inflected forms to a singleword so they can be analyzed as a single item.
Example : “run, runs, ran and running are formsof the same lexeme, with run as the lemma.”
Phase 2 : Candidate Term Extraction
Candidate Term Extraction
• Approaches to Candidate Term Extraction :1. Simply extracting single words as candidate
terms. If you task extracts single words as terms.(Naïve Approach)
2. A generic N-gram extractor that extracts ‘ngrams’.
Final Approach : Stanford’s OpenNLP NPE(Noun Phrase Extractor) that extracts nounphrases as candidate terms.
Why are other two Approaches worth mentioning?
Performance of Term Extraction Algorithms istext corpus dependent.
(Our dataset was more receptive to NPE)
Phase 3 : Index Building
Building Document Index
• Using Jate toolkit to build a corpus index (Pre-Requisite for Term Extraction).
• Memory Based / Disk Resident file / Exportingto HSQL (HyperSQL).
Phase 4 : Building Statistical Features
Building Features for Jate Toolkit
• Word Count
• Feature Corpus Term Frequency (A featurestore that contains information of termdistributions over a corpus)
• Feature Term Nest Frequency (A feature storethat contains information of nested terms)Example: “Hedgehog" is a nested term in"European Hedgehog".
• Executing a single or multithreaded client.
Phase 5 : Register and Execute Algorithms
Jate Output File : term { variations } score The output file is arranged in descending order
of score.
Phase 6 : Post Processing
Writing an Output file suitable for submission.Format : DocumentId { query terms }
(Maximum 10 non-repeating query terms)
Evaluation
• Last year PAN CLEF Baseline :
Precision = 0.244388609715 (200) queries
• Performance for Term Extraction Algorithms: (105) queries1. IBM’s GlossEx : 0.171428571429
2. C Value : 0.0598255721489
3. TermEx : 0.0635
4. Weirdness : 0.03190851
5. RIDF : 0.176470588235
6. TF-IDF : 0.13058482157
RESULTS
• The code is live on github!https://github.com/myth17/QF
• Code, Query Logs and entire results submitted toKyle.
• Working on incorporating the other alpha termextraction algorithms.
• Future Work : How can the results be improvedand integrated with topic modeling?
Questions ?(Thank You!)