Final presentation

IST 441 Query Formulation for Similarity Search

Student : Nitish Upreti

Customer : Kyle Williams

[email protected]

[email protected]

mailto:[email protected]



OUTLINE

• Introduction • Motivation• Challenges with Similarity Search• Background & Reference Point• Approaches to Similarity Search• Our Approaches to Problem• JateToolkit Introduction• Solution Architecture• Evaluation• Conclusion

What is Similarity Search?

“ Given a sample document and a standard Websearch engine, the goal is to find similardocuments to the given document. ”

What is a similar document?

• Cosine Similarity

• Citation Similarity

• Code Similarity

• Multimedia Content Similarity

Motivation

Plagiarism DetectionProcess of locating instances of plagiarism in asuspicious document from the web.

Example : Turnitin™

Content Recommendation

Recommending articles from credible news sources basedon social media entities such as tweets.

Academic Scenario : Research Paper Recommendation

Finding relevant documents for research paperrecommendation.

Challenge Involved

• Constructing queries from the sample documentin order to find similar documents is not obvious.

• Several Constraints on the maximum number ofqueries and results to be downloaded forscalability constraints.

• Capture different facets of Similarity :

How can we be general enough to capture thetheme but also specific to capture uniquedocument attributes? (Domain Dependent)

BACKGROUND

The Big Picture

Credits : Plagiarism Candidate Retrieval Using Selective Query Formulation and Discriminative Query ScoringNotebook for PAN at CLEF 2013

Our Reference Point

• Source Retrieval is the KEY component.

(Dictates the possibility of future steps)

• Query Formulation is at the heart of thisproblem.

• Challenges with :

– How can we design better algorithms to formulate accurate queries?

– What has been done and what can be explored?

Our Reference Point (Contd..)

• CLEF: Conference and Labs of the Evaluation Forum.

• PAN Labs centers around the topics ofplagiarism, authorship, and social softwaremisuse.

– Author Identification

– Author Profiling

– Plagiarism Detection

• Evaluation possible in a Plagiarism domain.

Approaching Similarity Search

Major classes of Similarity Search :

• Choosing sentences from text corpus.

• Choosing a set of generic keywords.

• Term Extraction Algorithms.

• Topic Mining for document using MachineLearning techniques.

Mix and Match Ideas depending and employwell known tweaks depending on the scenario.(Most of it is experimental)

Query Formulation ApproachTerm Extraction

(Automatic extraction of relevant terms from a given corpus)

Approach Contd…

• Central Theme : Term Extraction Algorithms

• Approach Similarity Search in context of Term Extraction algorithms.

• Design a framework which incorporates which these algorithms.

• Evaluate the algorithms.

• Document all the approaches.

Enter JateToolkit

Java Automatic Term Extraction toolkitA library of state-of-the-art term extractionalgorithms and framework for developing termextraction algorithms.https://code.google.com/p/jatetoolkit/

https://code.google.com/p/jatetoolkit/







Term Extraction Approaches…• Term Extraction Algorithms :

– TF-IDF– RIDF – Weirdness – C-value – GlossEx – TermEx (Open Ended Project : Work in Progress)– Justeson & Katz Algorithm– NC Value Algorithm– Rake Algorithm– Chi-squared Algorithm

Solution Architecture

Phase 1 : Pre-Processing

Pre-Processing Document

StopList Pre-Processing

Extremely common words which would appearto be of little value in helping select documentsmatching a user need are excluded from thevocabulary entirely. These words form the StopList.

• Use Jate’s built in “StopList” for filtering.

Pre-Processing Document Contd…

Lemmatization

Group together words that are present in thedocument as different inflected forms to a singleword so they can be analyzed as a single item.

Example : “run, runs, ran and running are formsof the same lexeme, with run as the lemma.”

Phase 2 : Candidate Term Extraction

Candidate Term Extraction

• Approaches to Candidate Term Extraction :1. Simply extracting single words as candidate

terms. If you task extracts single words as terms.(Naïve Approach)

2. A generic N-gram extractor that extracts ‘ngrams’.

Final Approach : Stanford’s OpenNLP NPE(Noun Phrase Extractor) that extracts nounphrases as candidate terms.

Why are other two Approaches worth mentioning?

Performance of Term Extraction Algorithms istext corpus dependent.

(Our dataset was more receptive to NPE)

Phase 3 : Index Building

Building Document Index

• Using Jate toolkit to build a corpus index (Pre-Requisite for Term Extraction).

• Memory Based / Disk Resident file / Exportingto HSQL (HyperSQL).

Phase 4 : Building Statistical Features

Building Features for Jate Toolkit

• Word Count

• Feature Corpus Term Frequency (A featurestore that contains information of termdistributions over a corpus)

• Feature Term Nest Frequency (A feature storethat contains information of nested terms)Example: “Hedgehog" is a nested term in"European Hedgehog".

• Executing a single or multithreaded client.

Phase 5 : Register and Execute Algorithms

Jate Output File : term { variations } score The output file is arranged in descending order

of score.

Phase 6 : Post Processing

Writing an Output file suitable for submission.Format : DocumentId { query terms }

(Maximum 10 non-repeating query terms)

Evaluation

• Last year PAN CLEF Baseline :

Precision = 0.244388609715 (200) queries

• Performance for Term Extraction Algorithms: (105) queries1. IBM’s GlossEx : 0.171428571429

2. C Value : 0.0598255721489

3. TermEx : 0.0635

4. Weirdness : 0.03190851

5. RIDF : 0.176470588235

6. TF-IDF : 0.13058482157

RESULTS

• The code is live on github!https://github.com/myth17/QF

• Code, Query Logs and entire results submitted toKyle.

• Working on incorporating the other alpha termextraction algorithms.

• Future Work : How can the results be improvedand integrated with topic modeling?

https://github.com/myth17/QF








Questions ?(Thank You!)

Engineering

Final presentation