Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Dr. Rao Muhammad Adeel Nawab
2
How to Read a Research Paper?
Session V
Making Summary and Documenting a Paper
How to Work
��نن .اکام ��نن کام ی
شو�
نی �
شو�
ن. ا�
ک ی ش
و�نی �
شو�
ن��� ھ �ے �و سا�ت اهللا ��نن .ام
Dr. Rao Muhammad Adeel Nawab
3
ین اك نستع إیاك نعبد وإی ت :آ�یی �ے مد د م ھ �ہ �ب
تں اور � �ی ے �ہ
ت��� ی �ببادت ری �ہ �ی ا هللا �م �ت ں بی �ی ے �ہ ��ت بن ا
4
Dr. Rao Muhammad Adeel Nawab
��تتں� دعا��ی
ط ٱلمستقیم صر ر ھم ط ٱلذین أنعمت علی ٱھدنا ٱلص��ی
ہ �یہ دعا مانن :ںروزا�ن
ع ے ا�نن
و �تن �پ� � �� راہ �ب د�ی راہ د�ھا ان �و�وں ں �بی ��ی
ا�ہ �بی ام
Power of Dua
5
Dr. Rao Muhammad Adeel Nawab
یة إنما األعمال بالن
ے۔ �ۓ �ہ ماری �ببادت ا �ہ ا اور �پ�ھابن �یہ �پ��بن�یت �ے �� �ن د�ت �� خن دا �وق خن
ا اور �خن �� رضن ان �ے هللا �(هللا ے د�بیھ ں۔) سا�ت ں اور �پ�ھا��ی �پ���ی
صلى هللا علیھ وسلم نے ف رمایارسول �
ے۔( �توں �پ� �ہ �ی کا دارومدار �ن )ا�مال
Power of Neyat
6Dua - Take Help from Allah before starting any Task
Dr. Rao Muhammad Adeel Nawab
Balanced Life is Ideal Life 7
Get Excellence in five things
A Journey from BIGNNER toEXCELLENCE
You must have acombination of five thingswith different variations.However, aggregate will besame.
Dr. Rao Muhammad Adeel Nawab
Have a DADDU YAR in life to drain out on daily basis
8Excellence – FRIENDS
Dr. Rao Muhammad Adeel Nawab
کا ں ا�بباب ے ز�ی�ت ��ی ےکار �ہ وم�ب �ب�ہ
ک ے انی � �ہنکا� و
ت�وص � �� خن بی و �پ ص�ہ
�نش
�
9Excellence – FAMILY
Dr. Rao Muhammad Adeel Nawab
Take Dua’s of Parents and elders by doing their خدمت and ادب
Your wife/husband must be your best friend
Be humble and kind to kids, subordinates and poor people
10
OutlineUnderstand Order and Flow
Template-based Approach to Read a Paper
Dr. Rao Muhammad Adeel Nawab
11
Understand Order and Flow
Dr. Rao Muhammad Adeel Nawab
Order and Flow 12
Dr. Rao Muhammad Adeel Nawab
Document (or Paper) Level
Connection between Sections
Section Level
Connection between Paragraphs
Paragraph Level
Connection between Sentences
Sentence Level
Connection between Words / Phrases
Paper Outline 13
Dr. Rao Muhammad Adeel Nawab
Abstract
Introduction
Related Work
Corpus Generation Extract sentence/passage pairsAnnotation guidelinesAnnotations Corpus statisticsExamples from the corpusLinguistic analysis of the transformations
Paper Outline (cont.) 14
Dr. Rao Muhammad Adeel Nawab
Text reuse detection experimentsTranslation plus mono-lingual analysisExperimental setup
Results and analysis
Conclusions and future work
Acknowledgements
References
15
Template-based Approach to Read a Paper
Dr. Rao Muhammad Adeel Nawab
Reading Abstract 16
Dr. Rao Muhammad Adeel Nawab
Read sentence by sentence and do interpretation of each sentenceTemplate
Problem (or Research Problem)Importance of ProblemApplication(s) of ProblemSummary of Existing LiteratureResearch GapProposed SolutionCharacteristics of Proposed SolutionResults and Main Findings
Abstract 17
Dr. Rao Muhammad Adeel Nawab
Sentence 1:Text reuse is becoming a serious issue in many fields andresearch shows that it is much harder to detect when itoccurs across languages.
InterpretationCross-lingual Text Reuse Detection is a challenging task
InsightsResearch Problem Cross-Lingual Text Reuse DetectionImportance It is a wide spread problem and also
difficult to detect
Abstract 18
Dr. Rao Muhammad Adeel Nawab
Sentence 2:The recent rise in multi-lingual content on the Web hasincreased cross-language text reuse to an unprecedentedscale.
InterpretationReason(s) for rise in cross-lingual text reuse
InsightsJustification why cross-lingual text reuse detection is an important problem to be addressed, it has generic applications and why it is on rise?
Abstract 19
Dr. Rao Muhammad Adeel Nawab
Sentence 3:Although researchers have proposed methods to detect it, onemajor drawback is the unavailability of large-scale gold standardevaluation resources built on real cases.
InterpretationSummary of existing literature and Research Gap
InsightsSummary of existing literature
- Researchers have proposed methods to detect it
Research gap - Unavailability of large-scale gold standard evaluation resources built on real cases
Abstract 20
Dr. Rao Muhammad Adeel Nawab
Sentence 4:To overcome this problem, we propose a cross-language sentence/passage level text reuse corpus forthe English-Urdu language pair.
InterpretationPurposed Solution
InsightsNeed to be very specific in proposing solution
Cross-language sentence/passage level text reuse corpus English-Urdu language pair
Abstract 21
Dr. Rao Muhammad Adeel Nawab
Sentence 5:The Cross-Language English-Urdu Corpus (CLEU) hassource text in English while the derived text is inUrdu.
InterpretationBrief details of Main Contribution i.e. proposed solution (cross-lingual text reuse corpus for English-Urdu language pair)
InsightsBrief Detail of Proposed Solution - source text in English while the derived text is in Urdu.Note that this is the Selling Point of the Paper
Abstract 22
Dr. Rao Muhammad Adeel Nawab
Sentence 6:It contains in total 3,235 sentence/passage pairs manuallytagged into three categories i.e., near copy, paraphrasedcopy, and independently written.
InterpretationMain characteristic of the proposed solution (cross-lingual text reuse corpus for English-Urdu language pair)
InsightsTotal 3,235 sentence/passage pairs Manually tagged Three categories i.e., near copy, paraphrased copy, and independently written.
Abstract 23
Sentence 7:Further, as a second contribution, we evaluate the Translation plusMono-lingual Analysis method using three sets of experiments on theproposed dataset to highlight its usefulness.
InterpretationBrief details of Secondary contribution and applications of the proposed solution
InsightsTechnique – translation plus mono-lingual analysis Experiments – 3Evaluation – comparison of various techniques on the same dataset (proposed in this study)Note that it is not the selling point of this research work
Dr. Rao Muhammad Adeel Nawab
Abstract 24
Sentence 8:Evaluation results (F1 = 0.732 binary, F1 = 0.552 ternary classification)indicate that it is harder to detect cross-language real cases of textreuse, especially when the language pairs have unrelated scripts.
InterpretationTypes of classification, best results and main finding
InsightsTernary Classification – Verbatim, Paraphrased and Independently WrittenBinary Classification – Derived vs non-Derived ResultsResults - (F1 = 0.732 binary, F1 = 0.552 ternary classification) Main Findings - It is harder to detect cross-language real cases of text reuse, especially when the language pairs have unrelated scripts.
Dr. Rao Muhammad Adeel Nawab
Abstract 25
Dr. Rao Muhammad Adeel Nawab
Sentence 9:The corpus is a useful benchmark resource for thefuture development and assessment of cross-languagetext reuse detection systems for the English-Urdulanguage pair.
InterpretationStrengths and applications of proposed solution (cross-lingual text reuse corpus for English-Urdu language pair)
Abstract: Overall Interpretations 26
Dr. Rao Muhammad Adeel Nawab
Cross-lingual text reuse detection is a challenging task
Reasons for rise in cross-lingual text reuse (Importance and application)
Summary of existing literature and Research Gap
Purposed Solution
Brief details of Main Contribution i.e. proposed solution (cross-lingual text reuse corpus for English-Urdu language pair)
1
2
3
4
5
Abstract: Overall Interpretations 27
Dr. Rao Muhammad Adeel Nawab
Main characteristic of proposed solution (cross-lignaultext reuse corpus for English-Urdu language pair)
Brief details of secondary contribution
Types of classification, best results and main finding
Strengths andApplications of proposed solution
6
7
8
9
Reading Introduction 28
Dr. Rao Muhammad Adeel Nawab
Summarize each paragraph into a single sentence
See the order and flow of paragraphs
29Introduction: Passage 1
Dr. Rao Muhammad Adeel Nawab
Text reuse, the process of creating new texts using existing ones, hasbecome very common because of free, readily available, and large digitalrepositories. In addition, state of- the-art text processing applications havemade it very simple to copy-paste text and give it a new identity. Textborrowed from such sources can be reused verbatim (copy paste) orrewritten (paraphrased). If the rewriting process involves complex editingoperations (e.g., lexical substitution, changes in syntax, summarization,synonym replacement, altering word order, or verb or noun nominalization)then the borrowed text transforms into an independently written piece(Clough, Gaizauskas, Piao, & Wilks, 2002; Maurer, Kappe, & Zaka, 2006).Moreover, new text can be created using text from one or more sources andthe amount of reused text varies from local text reuse (such as, a singleword, small chunks, or sentences) to global text reuse (i.e., an entiredocument; Mittelbach, Lehmann, Rensing, & Steinmetz, 2010; Seo & Croft,2008).
30Introduction - Passage 1 - Summary
Dr. Rao Muhammad Adeel Nawab
Definition of Text Reuse
Importance of Text Reuse
Levels of Text Reuse
Verbatim
Paraphrased
Independently Written
Types of Text Reuse (Local vs Global)
31Introduction: Passage 2
Dr. Rao Muhammad Adeel Nawab
Unlike academic plagiarism (the unacknowledged reuseof text), text reuse is a common practice in journalism.Newspapers pay news agencies for their text(s) (heretermed source text) to generate news stories (termedderived text). The text purchased from a news agencycan be reused “verbatim” or “paraphrased” to createthe newspaper story. However, at times the newspaperstory might also be independently written without usingany news agency text (Clough, 2010).
32Introduction - Passage 2 - Summary
Dr. Rao Muhammad Adeel Nawab
Definition of Plagiarism
Process of text reuse in Journalism
33Introduction: Passage 3
Dr. Rao Muhammad Adeel Nawab
Text reuse can either be mono-lingual (when the source and derived text share the same language)or cross-lingual (when the source text is in one language and the derived text is in another). Mono-lingual text reuse detection has been a subject undergoing intense study for the researchcommunity for some time, but recently the focus has shifted towards detecting text reuse acrosslanguages (Ceska, Toman, & Jezek, 2008; Franco-Salvador, Gupta, Rosso, & Banchs, 2016; Gupta,Barrón-Cedeño, & Rosso, 2012; Potthast, Barrón- Cedeño, Stein, & Rosso, 2011). A recent studysuggested that the scale of cross-language text reuse and plagiarism is increasing (Barrón-Cedeño,Gupta, & Rosso, 2013). This is because of the following reasons: (a) users of under-resourcedlanguages, which are very large in number, commonly use text(s) from resource-rich languages, (b)speakers of one language staying in a country other than their own can consult the text(s) in theirnative language, and (c) often speakers of one language are keen to write in a foreign language.Likewise, the recent rise in multi-linguality, freely available machine translation systems, andintelligent word processors are contributing to an environment where it is easy to reuse text acrosslanguages, but with a perception of being harder to detect such reuse (Somers, Gaspari, & Niño,2006). Therefore, there is an ever-increasing necessity to develop standard evaluation resourcesand methods to detect cross-language text reuse for the various language pairs.
34Introduction - Passage 3 - Summary
Dr. Rao Muhammad Adeel Nawab
Mono-lingual vs Cross-lingual text reuse
Three main reasons for rise in cross-lingual text reuse
35Introduction: Passage 4
Dr. Rao Muhammad Adeel Nawab
To develop, evaluate, and analyze methods for crosslanguage text reuse (either local or global), gold standardbenchmark corpora are needed. These corpora can begenerated in three ways: (a) artificial - using an automatictext altering tool, (b) simulated - humans are asked torewrite source text to create new text, and (c) real - newagency’s text is reused by journalists to create thenewspaper story. It seems likely that cross-language textreuse detection methods which are trained on real examplesare more likely to give realistic performance that weinvestigate further in our paper.
36Introduction - Passage 4 - Summary
Dr. Rao Muhammad Adeel Nawab
Why it is important to develop cross-lingual text
reuse corpus?
Three ways to generate a corpus
artificial
simulated
real
37Introduction: Passage 5
Dr. Rao Muhammad Adeel Nawab
This study aims to develop a publicly available largescale benchmark corpus that contains realexamples of cross-language text reuse at sentence/passage level1 for the English-Urdu languagepair. Urdu belongs to the Indo- Aryan family, widely spoken in Pakistan and the northern parts ofIndia (Alam, Mehmood, & Nelson, 2015). Moreover, it has a strong Perso-Arabic influence in itsvocabulary and is written in a Perso-Arabic script from right to left. It is also spoken world-widebecause of the South Asian Diaspora (with large populations in the Middle East, United States, UK,Norway, and Canada etc.; Daud, Khan, & Che, 2016). Despite that, for the English-Urdu languagepair, there are no publicly available cross language text reuse detection datasets known to us.Moreover, previous research has tended to focus more on European languages. The corpusdeveloped as an outcome of this study contains 3,235 pairs of real examples of cross-language textreuse at sentence/passage level (the source text is in English whereas derived text is in Urdu). Eachsentence/passage pair is categorised as i) Near Copy (NC; 751 pairs), ii) Paraphrased Copy (PC; 1751pairs), or iii) Independently Written (IW; 733 pairs). The corpus is representative enough to serveas a benchmark dataset for: (a) developing and evaluating techniques for cross-language text reusedetection for the English-Urdu language pair, (b) obtaining an insight into what edit operations arelikely used by journalists in reusing text, and (c) to foster text reuse detection research in theEnglish-Urdu language pair.
38Introduction - Passage 5 - Summary
Dr. Rao Muhammad Adeel Nawab
Main aim of this study
Importance of Urdu
Summary of Literature Review
Research Gap
Proposed Solution (or Corpus)
Characteristics and applications of proposed solution
39Introduction: Passage 6
Dr. Rao Muhammad Adeel Nawab
The remainder of this article is organized as follows. We first reviewpreviously developed cross-lingual text reuse or plagiarism detectioncorpora. Then we present a detailed discussion on the CLEU corpusconstruction, its statistics, characteristics, linguistic analysis, andexample cases. This is followed by the explanation of cross languagetext reuse detection experiments that we performed on our corpus tohighlight its strengths and its utility for evaluation purposes. Finally, wepresent the results and their analysis and then conclude the article.
40Introduction - Passage 6 - Summary
Dr. Rao Muhammad Adeel Nawab
Organization of Paper
Introduction: Overall Interpretations 41
Dr. Rao Muhammad Adeel Nawab
Definition of Text Reuse, Importance of Text Reuse,Levels of Text Reuse (Verbatim, Paraphrased,Independently Written) and Types of Text Reuse (Local vsGlobal)
Definition of Plagiarism and Process of text reuse inJournalism
Mono-lingual vs Cross-lingual text reuse and Three mainreasons for rise in cross-lingual text reuse
1
2
3
Introduction: Overall Interpretations 42
Dr. Rao Muhammad Adeel Nawab
Organization of Paper6
Main aim of this study, Importance of Urdu, Summary ofLiterature Review, Research Gap, Proposed Corpus (orSolution), It’s characteristics and applications
5
Why it is important to develop cross-lingual text reuse corpus?and Three ways to generate a corpus (artificial, Simulated &real)
4
Reading - Related Work 43
Dr. Rao Muhammad Adeel Nawab
Summarize each paragraph into a single sentence
See the order and flow of paragraphs
44Related Work: Passage 1
Dr. Rao Muhammad Adeel Nawab
In the previous literature, efforts have been made to develop standard evaluationresources for measuring cross language text reuse (and plagiarism) for different thelanguage pairs. For example, PAN authors have developed a series of corpora withartificial and simulated examples of plagiarism at document level (Potthast, Barrón-Cedeño, Eiselt, Stein, & Rosso, 2010; Potthast, Eiselt, Barrón- Cedeño, Stein, &Rosso, 2011; Potthast et al., 2012–2014; Stein, Rosso, Stamatatos, Koppel, &Agirre, 2009). The majority (90%) of the text plagiarism cases in these corpora aremono-lingual, however, there exists a small portion (10%) of cross-lingual plagiarismcases too. These cross language plagiarism cases are for the English-German andEnglish-Spanish language pairs. Most of these cases are artificial (created usingautomatic MT [Machine Translation] system that is, Google Translate3) but a smallnumber of them are created manually (i.e., translated by humans). These corporahave been used to evaluate text plagiarism detection methods in the competitionsheld annually.
45Related Work: Passage 1 - Summary
Dr. Rao Muhammad Adeel Nawab
Opening sentence
Summary of PAN text reuse corpora
46Related Work: Passage 2
Dr. Rao Muhammad Adeel Nawab
The CL!TR4 (Cross-Language Indian Text Reuse) corpus is the first of its kind developedspecifically for the analysis of cross-language text reuse detection in the Hindi-Englishlanguage pair at document level (Barrón-Cedeño, Rosso, Devi, Clough, & Stevenson,2013). The suspicious documents it contains are in Hindi and the source documents inEnglish language. The training set includes 198 suspicious (Hindi) and 5,032 source(English) documents, whereas the test set has 190 suspicious (Hindi) and 5,032 source(English) documents. The CL!TR corpus contains simulated cases of text reuse. Thevolunteers involved in the study were asked to answer a set of 10 questions, related tothe tourism and computer science domains, to create suspicious documents. It containsthree types of revisions, categorized by the amount of obfuscation used, namely“Exact” (without any modifications, translation only), “Light” (very few modifications,translation, and manual correction), and “Heavy” (detailed modifications, translation,and manual correction). The corpus also contains “Original” (independently written)documents which were generated without referring to the source documents but usingthe learning material provided.
47Related Work: Passage 2 - Summary
Dr. Rao Muhammad Adeel Nawab
English-Hindi CLITRA Cross-Lingual Text Reuse Corpus
48Related Work: Passage 3
Dr. Rao Muhammad Adeel Nawab
Another cross-language corpus of 110 documents (55 source inEnglish and 55 plagiarized in Bangla) that contains simulatedplagiarism cases and was built using student’s reports from auniversity (Arefin, Morimoto, & Sharif, 2013). Two groups of 55students each, were asked to write a report on a given topic. 50reports are used as training set whereas the remaining 10 as testset. Plagiarism cases were obfuscated by replacing contents withseveral plagiarized fragments of different lengths. However, thecorpus is not available to download.
49Related Work: Passage 3 - Summary
Dr. Rao Muhammad Adeel Nawab
English-Bangle Cross-Lingual Text Reuse Corpus
50Related Work: Passage 4
Dr. Rao Muhammad Adeel Nawab
Recently, a cross-language (Urdu-English language pair) document levelplagiarism detection corpus was submitted for the PAN 2016 shared task (Hanifet al., 2015). The corpus is divided in two sets, 500 source (Urdu) and 500suspicious (English) documents, and contains only simulated examples ofplagiarism. The source documents are Wikipedia excerpts whereas theplagiarized documents were manually created by university students. Thestudents were asked to plagiarize 270 documents on three levels of obfuscation(“Near Copy,” “Light Revision,” and “Heavy Revision”), whereas 230documents in the corpus are “Nonlabialized.” Moreover, the plagiarism casesinserted in the suspicious documents are of various length that is small ( < 50tokens), medium (50–100 tokens), and large (100–200 tokens). The corpus isthe first cross language (Urdu-English pair) dataset created for plagiarismdetection research at the document level.
51Related Work: Passage 4 - Summary
Dr. Rao Muhammad Adeel Nawab
English-Urdu CLUE Cross-Lingual Text Reuse Corpus
52Related Work: Passage 5
Dr. Rao Muhammad Adeel Nawab
CLiPA (Cross-Language Plagiarism Analysis) is a publicly availablefragment or sentence level corpus containing five source sentences (inEnglish) which were used to generate plagiarized cases (in Spanish andItalian) using both machine translation (artificial) and manual translation(simulated; Barrón-Cedeño, Rosso, Pinto, & Juan, 2008). The machinetranslation cases were generated using five different services to havevariations whereas for manually (human) simulated plagiarism cases,nine volunteers were asked to plagiarize each of the five sourcefragments. They were further requested to generate the same numberof nonplagiarized cases as well. The corpus was used in experiments ontext plagiarism detection research in the English-Spanish and English-Italian language pairs.
53Related Work: Passage 5 - Summary
Dr. Rao Muhammad Adeel Nawab
English-Spanish and English-Italian CLIPA Cross-Lingual
Text Reuse Corpus
54Related Work: Passage 6
Dr. Rao Muhammad Adeel Nawab
In summary, the corpora discussed above either contain artificialor simulated examples of cross-language text reuse (orplagiarism). Cross-language text reuse detection methodsdeveloped using these non-real types of text reuse are unlikely toperform well on real cases of text reuse that occur in real worldscenarios (e.g., academia, journalism; Weber-Wulff, 2010).
55Related Work: Passage 6 - Summary
Dr. Rao Muhammad Adeel Nawab
Summary of existing literature and Research Gap
(unavailability of real examples of text reuse).
56Related Work: Passage 7
Dr. Rao Muhammad Adeel Nawab
Moreover, the simulated cases created in a controlled environmentusing crowd-sourcing do not represent the strategies used by humanswhen rewriting text in real life. Because cross-language text reuse isincreasing day-by day, first, there is an urgent need to develop textreuse detection corpora with real examples of text reuse. Second,the available corpora for research are created at document level andthere are no corpora available at sentence/passage level for theEnglish-Urdu language pair. Last, the corpora listed above are notlarge enough to generate robust results. This is not surprising becauseit takes a lot of manual effort to create corpora with simulatedexamples of text reuse or plagiarism.
57Related Work: Passage 7 - Summary
Dr. Rao Muhammad Adeel Nawab
Limitations of existing work (or corpora)
58Related Work: Passage 8
Dr. Rao Muhammad Adeel Nawab
To develop and evaluate cross-language text reuse detectionmethods for the real-world scenario, we need to create corporawith real examples of text reuse. To fill this gap, our research workproposes a large-scale gold standard benchmark corpus containingreal examples to measure cross-language text reuse atsentence/passage level for the English-Urdu language pair. Thenext section describes the corpus generation process in detail.
59Related Work: Passage 8 - Summary
Dr. Rao Muhammad Adeel Nawab
Justification for need of a new corpus, out contribution
in developing a new corpus and connection with the
next section
Related Work : Overall Interpretations 60
Dr. Rao Muhammad Adeel Nawab
Opening sentence and Summary of PAN text reusecorpora
English-Hindi CLITRA Cross-Lingual Text Reuse Corpus
English-Bangle Cross-Lingual Text Reuse Corpus
1
2
3
English-Urdu CLUE Cross-Lingual Text Reuse Corpus4
Related Work : Overall Interpretations 61
Dr. Rao Muhammad Adeel Nawab
Summary of existing literature and Research Gap(unavailability of real examples of text reuse).
6
English-Spanish and English-Italian CLIPA Cross-Lingual TextReuse Corpus
5
Justification for need of a new corpus, out contribution indeveloping a new corpus and connection with the nextsection
8
Limitations of existing work (or corpora)7
Reading - Corpus Generation 62
Dr. Rao Muhammad Adeel Nawab
Purpose of CorpusCross-lingual text reuse detection
Corpus Generation ProcessExtracting sentence/passage pairsPreparation of annotation guidelinesAnnotation of text by three annotatorsComputing inter-annotator agreement
Corpus Characteristics 63
Dr. Rao Muhammad Adeel Nawab
Language Pair
English-Urdu
Levels of Text Reuse
Verbatim – 741.Paraphrased – 1751.Independently Written –733.
Standardization
XML format
Global or Local
Local (Sentence / Passage level)Size of Corpus total 3235 Pairs
Reading – Text Reuse Detection Experiments 64
Dr. Rao Muhammad Adeel Nawab
Techniques:In which category it falls?How it works?Strengths and weaknesses?In which previous studies it has been used?
Translation plus Monolingual analysis N-gram OverlapGreedy String TilingLongest Common Subsequence
For each technique note 4 things
Evaluation Methodology 65
Dr. Rao Muhammad Adeel Nawab
Binary classification
derived (verbatim + paraphrased) vs non-derived(independently written).
Ternary classification
verbatimvs paraphrasedvs independently written.
Supervised text classification task
Evaluation Methodology 66
Dr. Rao Muhammad Adeel Nawab
Evaluation Measures
Precision
Machine Learning algorithms
J48
Machine Learning Toolkit
WEKA
Recall
F1
Random Forest
SMO
Reading – Results and Analysis 67
Dr. Rao Muhammad Adeel Nawab
Explain “Terms” in the Table.
Explain “Overall” best results
Explain results with individual techniques
Conclude your results
proposed approach outperforms baseline approach
Summarize and Document Paper in Tabular Format 68
Dr. Rao Muhammad Adeel Nawab
Sr no. Year Paper Title Authors
1 2018
CLEU - A Cross-Language English-Urdu Corpus and Benchmark For Text Reuse Experiments
Iqra muneerMuhammad SharjeelMuntaha IqbalRao M. Adeel NawabPaul Rayson
69
Dr. Rao Muhammad Adeel Nawab
Conference / Journal Publisher Problem Importance of
Problem
Journal of the Association for
Information Science and Technology (JASIST).
John Wiley & Sons.
Cross Lingual Text Reuse Detection
The recent rise in multi-lingual content on the Web has increased cross-language text reuse to an unprecedented scale.
Summarize and Document Paper in Tabular Format
70
Dr. Rao Muhammad Adeel Nawab
Applications of Problem
Summary of Literature Review Research Gap
1. Cross-lingual Plagiarism detection
2. Duplicate content removal from Web
Cross-lignaul text reuse detection corpora have been developed for various languages including English Urdu, English-Hindi, English Spanish.
One major drawback is the unavailability of large-scale gold standard evaluation resources built on real cases for cross-lingual text reuse detection, particularly for English-Urdu language pair
Summarize and Document Paper in Tabular Format
71
Dr. Rao Muhammad Adeel Nawab
Proposed Solution Purpose of Corpus
Corpus Generation Process
A cross-language sentence/passage level text reuse corpus for the English-Urdu language pair
Develop systems to detect cross-lignual text reuse for English-Urdu language pair
1. Data collection from news articles
2. Related pairs extractions3. Annotation guidelines
Corpus / DatasetSummarize and Document Paper in Tabular Format
72
Dr. Rao Muhammad Adeel Nawab
Corpus Characteristics
Number of documents: 900Levels of Text reuse1. Exact Copy2. Paraphrase Copy3. Independently Written
Language: English – UrduLicense: Creative (Open access) Publicly available
Corpus / Dataset
Summarize and Document Paper in Tabular Format
73
Dr. Rao Muhammad Adeel Nawab
Technique Toolkit Evaluation Measures
Evaluation Methodology
Translation + Mono-lingual Analysis
1. Longest Common Subsequence
2. N-gram Overlap3. Greedy String Tiles
Weka1. Precision2. Recall3. F1-measure
1. Supervised document classification task.
2. Ten fold cross validation
Summarize and Document Paper in Tabular Format
74
Dr. Rao Muhammad Adeel Nawab
Classifiers Results Main Finding(s)
1. Random Forest2. Naive Bayes3. J484. SMO
Classification It is harder to detect cross-language real cases of text reuse, especially when the language pairs have unrelated scripts.
Binary Ternary
F1 = 0.735 using GST-
mml1
F1 = 0.549 using GST-mml1
Summarize and Document Paper in Tabular Format
75
Dr. Rao Muhammad Adeel Nawab
Future Work Any Remarks Source Code URL?
Improve results by developing a new
technique / algorithm.- Not available publicly
Summarize and Document Paper in Tabular Format
76
Physical Health
Mental Health
Social Health
Key to Success
7-9 hours sleep per night
3 healthy meals daily
30 minutes brisk walk or running or exercise
Offer 5 Namaaz daily with Jamaat
Help at least one person daily for هللا کی رضا
Practice Six Things on Daily Basis to Become a Great Human Being (Insha Allah)
Recite Durood Sharif daily (Min: 100 – Max: 125K)
Dr. Rao Muhammad Adeel Nawab
BECOME A VOLUNTEER
MAKE A D I FFERENCE
ھا �گا ر ا�پن
��ی ل��، ذوق �ن �و ا�پ بت ن
��ھا �گ ر ا�پ ���ن � �ہ
�و� وں �ے �بعد
تامد�
� �ا� ے نپ
ن�س � �ور �� نبات �پ ز�وں ں��ن ز��ی �ن
ھا �گا ر ا�پ ���ن ھا �گا، اک �ہ ر ا�پ اک ��ن