Automatic plagiarism detection system for specialized corpora

Authors

University Politehnica of Bucharest

Automatic Plagiarism Detection System for Specialized Corpora

Filip Cristian BuruianăAdrian ScoicăTraian Rebedea – traian.rebedea@cs.pub.roRazvan Rughiniș

Overview

• Introduction• System architecture • Detection of plagiarism• Algorithms for candidate selection• Algorithms for detailed analysis• Algorithms for post-procesing• Results• Conclusions

12.04.23 Sesiunea de Licenţe - Iulie 2012 2

Introduction

• Plagiarism: unauthorized appropriation of the language or thoughts of another author and the representation of that author's work as pertaining to one's own without according proper credit to the original author

• Lots of documents => automatic detection needed

• Information Retrieval– Stemming (ex. beauty, beautiful, beautifulness => beauti)– Vector Space Model– tf-idf weighting, cosine similarity

• Measuring results– precision, recall, granularity => F-measure

12.04.23 CSCS 2013 – Bucharest, Romania 3

Existing solutions

• Lots of commercial systems exist (Turnitin, Antiplag, Ephorus, etc.)• They are general solutions, topic independent• No open-source solutions that offer good results• No solutions specialized for Computer Science• Difficult to evaluate: need a good corpus (annotated by persons,

how to find plagiarized documents, etc.)

• AuthentiCop – developed for specialized corpora, also evaluated on general texts

• Used corpora: – PAN 2011 (“evaluation lab on uncovering plagiarism, authorship, and

social software misuse” at CLEF)– Bachelor thesis @ A&C

System Architecture

• Web interface for accessing AuthentiCop– Simple to add documents (text, pdf) and to highlight suspicios elements

12.04.23 5CSCS 2013 – Bucharest, Romania

System architecture

12.04.23 6

• Logical separation– Front-end (PHP, JavaScript + AJAX, jquery)– Back-end (C++)– Cross-Language Communication

• Scalable solution, easy to update– Web server (front-end) and the plagiarism detection

modules (back-end) may run on different machines– Plagiarism detection can be distributed on different

machines (distributed workers)• Several external open-source libraries are used

(e.g. Apache Tika, Clucene, etc.)

CSCS 2013 – Bucharest, Romania

System architecture

12.04.23 8

•Example: sequence of steps for processing PDF files:•Apache Tika is used for transforming PDFs into text

•Automatic build module for the back-end components•Automatic deployment system for the solution

Detection of plagiarism

• Different problems– Intrinsic plagiarism (analyze only the suspicious

document)– External plagiarism (also has a reference collection

to check against) • How large is the collection? Online sources?

• Source identification• Text allignment

Detection of plagiarism

Steps for external plagiarism detection

1.Candidate selection – Find pairs of suspicious texts– Combines source identification with text

allignment

2.Detailed analysis3.Post-processing

Algorithms for candidate selection

12.04.23 11

•Selection of the plausible pairs of plagiarism

•Using stop-words elimination, tf-idf & cosine•Initial hypothesis

•“Similarity Search Problem”: All-Pairs, ppjoin (Prefix Filtering with Positional Information Join)

Algorithms for candidate selection

12.04.23 12

•FastDocode (presented at PAN 2010)+ caching + sub-linear merging

•New approach- Text segments => fingerprints & indexing with Apache CLucene - Compute the number of inversions

N-grams length Segment dimension Retention rate TP FP FN Time (h) Plagdet

3 150 10% 5413 44522 11469 ~ 1 0.162

4 150 10% 4913 10297 11969 ~ 2 0.306

4 150 30% 7633 35169 9249 ~ 4.5 0.256

5 150 20% 5194 6256 11688 ~ 3 0.367

Used method (used on 1000 documents)

TP FP FN Prec. Recall Plagdet

Fingerprinting & indexing 685 494 761 0.581 0.474 0.522

FastDocode#3 634 4097 812 0.134 0.438 0.205

FastDocode#4 424 815 1022 0.342 0.293 0.316

Algorithms for detailed analysis

12.04.23 13

•DotPlot: “Sequence Alignment Problem”.

•Modified FastDocode• Extending the analysis to the right and to the left,

starting from common words/passages• Using passages instead of words as seeds for the

comparison• tf-idf weighting & cosine similarity

Image source: Wikipedia

Algorithms for post-processing

• Semantic analysis using LSA– Built a semantic space with papers from Computer

Science (and pages from Wikipedia)– Gensim framework in Pyhton

• Smith-Waterman Algorithm– Dynamic programming– Similar to the longest common subsequence– Insert and delete operations may have any cost

(they may be greater than 1)

Results

12.04.23 15

• Corpus: PAN 2011 (~ 22k documents)• Run time on laptop: ~ 20 hours• Results:

• Official results from PAN 2011:

Plagdet Recall Precision Granularity

0.221929185084 0.202996955425 0.366482242839 1.26150173611

Results

12.04.23 16

• Specific corpus for CS:– 940 BSc thesis + 8700 article on CS from Wikipedia

• Detecting thesis written in English: TextCat– 307 BSc thesis in English Plagiarized text Original text from Wikipedia

The Canny edge detector uses a filter based on the first derivative of a Gaussian, because it is susceptible to noise present on raw unprocessed image data, so to begin with, the raw image is convolved with a Gaussian filter. The result is a slightly blurred version of the original which is not affected by a single noisy pixel to any significant degree.

Because the Canny edge detector is susceptible to noise present in raw unprocessed image data, it uses a filter based on a Gaussian (bell curve), where the raw image is convolved with a Gaussian filter. The result is a slightly blurred version of the original which is not affected by a single noisy pixel to any significant degree.

• Some elements are incorrectly identified as plagiarism: quotes, bibliographic references

Conclusions

• Improving the corpus• The system uses several parameters that were

determined empirically => use machine learning for finding the best values

• Increase the speed of the processing• Improve the method: “bag of words” +

information about the position of the words• Need a better post-processing for real

documents (like scientific papers or thesis)

Thank you!

• Questions?

• Discussion

Automatic plagiarism detection system for specialized corpora

Technology

Plagiarism Plagiarism Plagiarism Plagiarism

Corpora 22

INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY … · plagiarism, paraphrasing plagiarism, and unintentional plagiarism. Copying refers to a type of plagiarism once a writer claims

Workshop Programme Multimodal Corpora From Multimodal ... Corpora... · "Multimodal Corpora From Multimodal Behaviour Theories to Usable Models" ... Analysis of gesture expressivity

Corpora in specialized communication Korpora in der ...dinamico.unibg.it/cerlis/public/CERLIS_SERIES_4_08.Niemants.pdf · premier cas, il choisira des exemples qu’il considère

Avoiding plagiarism, self-plagiarism, and other ...elearning.fit.hcmup.edu.vn/~longld/References for... · 1 Avoiding plagiarism, self-plagiarism, and other questionable writing practices:

Text Corpora and Lexical Resources - GitHub PagesCorpora Accessing Text Corpora Annotated Text Corpora Lexical Resources References Corpora When the nltk.corpus module is imported,

Locating metaphor candidates in specialized corpora using ... · Locating metaphor candidates in specialized corpora ... Outsiders to that discourse perceive such discourse-conventional

The exploitation of the COMETVAL corpus of tourism · PDF filetourism genres from interpersonal discourse and ... The use of language corpora before and after the ... Specialized corpus

Creating annotated corpora for supervised sense …verbs.colorado.edu/~mpalmer/Ling7800/complexsem.lecture2.pdf• Missing senses (deliberate omission of specialized uses of words;

Avoiding Plagiarism, Self-plagiarism, and Other Questionable

1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

KWIC corpora as a source of specialized definitional information: a pilot study (by Antonio San Martín)

The Technology of Plagiarism and Plagiarism Detection

Avoiding Plagiarism - Durham College · Avoiding Plagiarism Top Tips for Avoiding Plagiarism 1. Learn What Plagiarism Is Plagiarism is the unacknowledged use of another person’s

Plagiarism, Citation & Referencing - SEGi Universityscsj.segi.edu.my/library/PlagiarismCitationReferencing.compressed.p… · Plagiarism, Citation & Referencing What is Plagiarism?

Efficient Near Duplicate Document Detection for Specialized Corpora (2008)

Corpora in specialized communication Korpora in der … · 2018. 8. 1. · both corpora should cover a similar domain, variety of language and time span, and be of comparable length”

Corpora in specialized communication Korpora in der …dinamico.unibg.it/cerlis/public/CERLIS_SERIES_4_06.Argondizzo_Caru… · CARMEN ARGONDIZZO / ASSUNTA CARUSO / IDA RUFFOLO 1

Plagiarism and Self-plagiarism in the sciences ... and self-plagiarism in the sciences...Plagiarism and Self-plagiarism in the sciences: Definitions and cases Miguel Roig, Ph.D. roigm@stjohns.edu