View
1.511
Download
1
Category
Tags:
Preview:
Citation preview
Authors
University Politehnica of Bucharest
Automatic Plagiarism Detection System for Specialized Corpora
Filip Cristian BuruianăAdrian ScoicăTraian Rebedea – traian.rebedea@cs.pub.roRazvan Rughiniș
Overview
• Introduction• System architecture • Detection of plagiarism• Algorithms for candidate selection• Algorithms for detailed analysis• Algorithms for post-procesing• Results• Conclusions
12.04.23 Sesiunea de Licenţe - Iulie 2012 2
Introduction
• Plagiarism: unauthorized appropriation of the language or thoughts of another author and the representation of that author's work as pertaining to one's own without according proper credit to the original author
• Lots of documents => automatic detection needed
• Information Retrieval– Stemming (ex. beauty, beautiful, beautifulness => beauti)– Vector Space Model– tf-idf weighting, cosine similarity
• Measuring results– precision, recall, granularity => F-measure
12.04.23 CSCS 2013 – Bucharest, Romania 3
Existing solutions
• Lots of commercial systems exist (Turnitin, Antiplag, Ephorus, etc.)• They are general solutions, topic independent• No open-source solutions that offer good results• No solutions specialized for Computer Science• Difficult to evaluate: need a good corpus (annotated by persons,
how to find plagiarized documents, etc.)
• AuthentiCop – developed for specialized corpora, also evaluated on general texts
• Used corpora: – PAN 2011 (“evaluation lab on uncovering plagiarism, authorship, and
social software misuse” at CLEF)– Bachelor thesis @ A&C
12.04.23 CSCS 2013 – Bucharest, Romania 4
System Architecture
• Web interface for accessing AuthentiCop– Simple to add documents (text, pdf) and to highlight suspicios elements
12.04.23 5CSCS 2013 – Bucharest, Romania
System architecture
12.04.23 6
• Logical separation– Front-end (PHP, JavaScript + AJAX, jquery)– Back-end (C++)– Cross-Language Communication
• Scalable solution, easy to update– Web server (front-end) and the plagiarism detection
modules (back-end) may run on different machines– Plagiarism detection can be distributed on different
machines (distributed workers)• Several external open-source libraries are used
(e.g. Apache Tika, Clucene, etc.)
CSCS 2013 – Bucharest, Romania
System architecture
12.04.23 7CSCS 2013 – Bucharest, Romania
System architecture
12.04.23 8
•Example: sequence of steps for processing PDF files:•Apache Tika is used for transforming PDFs into text
•Automatic build module for the back-end components•Automatic deployment system for the solution
CSCS 2013 – Bucharest, Romania
Detection of plagiarism
• Different problems– Intrinsic plagiarism (analyze only the suspicious
document)– External plagiarism (also has a reference collection
to check against) • How large is the collection? Online sources?
• Source identification• Text allignment
12.04.23 CSCS 2013 – Bucharest, Romania 9
Detection of plagiarism
Steps for external plagiarism detection
1.Candidate selection – Find pairs of suspicious texts– Combines source identification with text
allignment
2.Detailed analysis3.Post-processing
12.04.23 CSCS 2013 – Bucharest, Romania 10
Algorithms for candidate selection
12.04.23 11
•Selection of the plausible pairs of plagiarism
•Using stop-words elimination, tf-idf & cosine•Initial hypothesis
•“Similarity Search Problem”: All-Pairs, ppjoin (Prefix Filtering with Positional Information Join)
CSCS 2013 – Bucharest, Romania
Algorithms for candidate selection
12.04.23 12
•FastDocode (presented at PAN 2010)+ caching + sub-linear merging
•New approach- Text segments => fingerprints & indexing with Apache CLucene - Compute the number of inversions
N-grams length Segment dimension Retention rate TP FP FN Time (h) Plagdet
3 150 10% 5413 44522 11469 ~ 1 0.162
4 150 10% 4913 10297 11969 ~ 2 0.306
4 150 30% 7633 35169 9249 ~ 4.5 0.256
5 150 20% 5194 6256 11688 ~ 3 0.367
Used method (used on 1000 documents)
TP FP FN Prec. Recall Plagdet
Fingerprinting & indexing 685 494 761 0.581 0.474 0.522
FastDocode#3 634 4097 812 0.134 0.438 0.205
FastDocode#4 424 815 1022 0.342 0.293 0.316
CSCS 2013 – Bucharest, Romania
Algorithms for detailed analysis
12.04.23 13
•DotPlot: “Sequence Alignment Problem”.
•Modified FastDocode• Extending the analysis to the right and to the left,
starting from common words/passages• Using passages instead of words as seeds for the
comparison• tf-idf weighting & cosine similarity
Image source: Wikipedia
CSCS 2013 – Bucharest, Romania
Algorithms for post-processing
• Semantic analysis using LSA– Built a semantic space with papers from Computer
Science (and pages from Wikipedia)– Gensim framework in Pyhton
• Smith-Waterman Algorithm– Dynamic programming– Similar to the longest common subsequence– Insert and delete operations may have any cost
(they may be greater than 1)
12.04.23 14CSCS 2013 – Bucharest, Romania
Results
12.04.23 15
• Corpus: PAN 2011 (~ 22k documents)• Run time on laptop: ~ 20 hours• Results:
• Official results from PAN 2011:
Plagdet Recall Precision Granularity
0.221929185084 0.202996955425 0.366482242839 1.26150173611
CSCS 2013 – Bucharest, Romania
Results
12.04.23 16
• Specific corpus for CS:– 940 BSc thesis + 8700 article on CS from Wikipedia
• Detecting thesis written in English: TextCat– 307 BSc thesis in English Plagiarized text Original text from Wikipedia
The Canny edge detector uses a filter based on the first derivative of a Gaussian, because it is susceptible to noise present on raw unprocessed image data, so to begin with, the raw image is convolved with a Gaussian filter. The result is a slightly blurred version of the original which is not affected by a single noisy pixel to any significant degree.
Because the Canny edge detector is susceptible to noise present in raw unprocessed image data, it uses a filter based on a Gaussian (bell curve), where the raw image is convolved with a Gaussian filter. The result is a slightly blurred version of the original which is not affected by a single noisy pixel to any significant degree.
• Some elements are incorrectly identified as plagiarism: quotes, bibliographic references
CSCS 2013 – Bucharest, Romania
Conclusions
• Improving the corpus• The system uses several parameters that were
determined empirically => use machine learning for finding the best values
• Increase the speed of the processing• Improve the method: “bag of words” +
information about the position of the words• Need a better post-processing for real
documents (like scientific papers or thesis)
12.04.23 17CSCS 2013 – Bucharest, Romania
Thank you!
• Questions?
• Discussion
12.04.23 CSCS 2013 – Bucharest, Romania 18
Recommended