30
IST 441 Query Formulation for Similarity Search Student : Nitish Upreti Customer : Kyle Williams [email protected] [email protected]

Final presentation

Embed Size (px)

DESCRIPTION

Term Extraction in Plagiarism Detection

Citation preview

Page 1: Final presentation

IST 441 Query Formulation for Similarity Search

Student : Nitish Upreti

Customer : Kyle Williams

[email protected]

[email protected]

Page 2: Final presentation

OUTLINE

• Introduction • Motivation• Challenges with Similarity Search• Background & Reference Point• Approaches to Similarity Search• Our Approaches to Problem• JateToolkit Introduction• Solution Architecture• Evaluation• Conclusion

Page 3: Final presentation

What is Similarity Search?

“ Given a sample document and a standard Websearch engine, the goal is to find similardocuments to the given document. ”

What is a similar document?

• Cosine Similarity

• Citation Similarity

• Code Similarity

• Multimedia Content Similarity

Page 4: Final presentation

Motivation

Plagiarism DetectionProcess of locating instances of plagiarism in asuspicious document from the web.

Example : Turnitin™

Content Recommendation

Recommending articles from credible news sources basedon social media entities such as tweets.

Academic Scenario : Research Paper Recommendation

Finding relevant documents for research paperrecommendation.

Page 5: Final presentation

Challenge Involved

• Constructing queries from the sample documentin order to find similar documents is not obvious.

• Several Constraints on the maximum number ofqueries and results to be downloaded forscalability constraints.

• Capture different facets of Similarity :

How can we be general enough to capture thetheme but also specific to capture uniquedocument attributes? (Domain Dependent)

Page 6: Final presentation

BACKGROUND

Page 7: Final presentation

The Big Picture

Credits : Plagiarism Candidate Retrieval Using Selective Query Formulation and Discriminative Query ScoringNotebook for PAN at CLEF 2013

Page 8: Final presentation

Our Reference Point

• Source Retrieval is the KEY component.

(Dictates the possibility of future steps)

• Query Formulation is at the heart of thisproblem.

• Challenges with :

– How can we design better algorithms to formulate accurate queries?

– What has been done and what can be explored?

Page 9: Final presentation

Our Reference Point (Contd..)

• CLEF: Conference and Labs of the Evaluation Forum.

• PAN Labs centers around the topics ofplagiarism, authorship, and social softwaremisuse.

– Author Identification

– Author Profiling

– Plagiarism Detection

• Evaluation possible in a Plagiarism domain.

Page 10: Final presentation

Approaching Similarity Search

Major classes of Similarity Search :

• Choosing sentences from text corpus.

• Choosing a set of generic keywords.

• Term Extraction Algorithms.

• Topic Mining for document using MachineLearning techniques.

Mix and Match Ideas depending and employwell known tweaks depending on the scenario.(Most of it is experimental)

Page 11: Final presentation

Query Formulation ApproachTerm Extraction

(Automatic extraction of relevant terms from a given corpus)

Page 12: Final presentation

Approach Contd…

• Central Theme : Term Extraction Algorithms

• Approach Similarity Search in context of Term Extraction algorithms.

• Design a framework which incorporates which these algorithms.

• Evaluate the algorithms.

• Document all the approaches.

Page 13: Final presentation

Enter JateToolkit

Java Automatic Term Extraction toolkitA library of state-of-the-art term extractionalgorithms and framework for developing termextraction algorithms.https://code.google.com/p/jatetoolkit/

Page 14: Final presentation

Term Extraction Approaches…• Term Extraction Algorithms :

– TF-IDF– RIDF – Weirdness – C-value – GlossEx – TermEx (Open Ended Project : Work in Progress)– Justeson & Katz Algorithm– NC Value Algorithm– Rake Algorithm– Chi-squared Algorithm

Page 15: Final presentation

Solution Architecture

Page 16: Final presentation

Phase 1 : Pre-Processing

Page 17: Final presentation

Pre-Processing Document

StopList Pre-Processing

Extremely common words which would appearto be of little value in helping select documentsmatching a user need are excluded from thevocabulary entirely. These words form the StopList.

• Use Jate’s built in “StopList” for filtering.

Page 18: Final presentation

Pre-Processing Document Contd…

Lemmatization

Group together words that are present in thedocument as different inflected forms to a singleword so they can be analyzed as a single item.

Example : “run, runs, ran and running are formsof the same lexeme, with run as the lemma.”

Page 19: Final presentation

Phase 2 : Candidate Term Extraction

Page 20: Final presentation

Candidate Term Extraction

• Approaches to Candidate Term Extraction :1. Simply extracting single words as candidate

terms. If you task extracts single words as terms.(Naïve Approach)

2. A generic N-gram extractor that extracts ‘ngrams’.

Final Approach : Stanford’s OpenNLP NPE(Noun Phrase Extractor) that extracts nounphrases as candidate terms.

Page 21: Final presentation

Why are other two Approaches worth mentioning?

Performance of Term Extraction Algorithms istext corpus dependent.

(Our dataset was more receptive to NPE)

Page 22: Final presentation

Phase 3 : Index Building

Page 23: Final presentation

Building Document Index

• Using Jate toolkit to build a corpus index (Pre-Requisite for Term Extraction).

• Memory Based / Disk Resident file / Exportingto HSQL (HyperSQL).

Page 24: Final presentation

Phase 4 : Building Statistical Features

Page 25: Final presentation

Building Features for Jate Toolkit

• Word Count

• Feature Corpus Term Frequency (A featurestore that contains information of termdistributions over a corpus)

• Feature Term Nest Frequency (A feature storethat contains information of nested terms)Example: “Hedgehog" is a nested term in"European Hedgehog".

• Executing a single or multithreaded client.

Page 26: Final presentation

Phase 5 : Register and Execute Algorithms

Jate Output File : term { variations } score The output file is arranged in descending order

of score.

Page 27: Final presentation

Phase 6 : Post Processing

Writing an Output file suitable for submission.Format : DocumentId { query terms }

(Maximum 10 non-repeating query terms)

Page 28: Final presentation

Evaluation

• Last year PAN CLEF Baseline :

Precision = 0.244388609715 (200) queries

• Performance for Term Extraction Algorithms: (105) queries1. IBM’s GlossEx : 0.171428571429

2. C Value : 0.0598255721489

3. TermEx : 0.0635

4. Weirdness : 0.03190851

5. RIDF : 0.176470588235

6. TF-IDF : 0.13058482157

Page 29: Final presentation

RESULTS

• The code is live on github!https://github.com/myth17/QF

• Code, Query Logs and entire results submitted toKyle.

• Working on incorporating the other alpha termextraction algorithms.

• Future Work : How can the results be improvedand integrated with topic modeling?

Page 30: Final presentation

Questions ?(Thank You!)