A Content Originality Analysis of HRD Focused .../67531/metadc... · in retractions over a ten-year...

APPROVED: Jeff Allen, Major Professor Kim Nimon, Co-Major Professor Lin Lin, Committee Member Cathy Norris, Chair of the Department of

Learning Technology Kinshuk, Dean of the College of

Information Victor Prybutok, Vice Provost of the

Toulouse Graduate School

A CONTENT ORIGINALITY ANALYSIS OF HRD FOCUSED DISSERTATIONS

AND PUBLISHED ACADEMIC ARTICLES USING Turnitin

PLAGIARISM DETECTION SOFTWARE

Robin James Mayes, B.S., M.S., M.S.

Dissertation Prepared for the Degree of

DOCTOR OF PHILOSOPHY

UNIVERSITY OF NORTH TEXAS

May 2017

Mayes, Robin James. A Content Originality Analysis of HRD Focused

Dissertations and Published Academic Articles using Turnitin Plagiarism Detection

Software. Doctor of Philosophy (Applied Technology and Performance Improvement),

May 2017, 167 pp., 15 tables, 19 figures, references, 167 titles.

This empirical exploratory study quantitatively analyzed content similarity indices

(potential plagiarism) from a corpus consisting of 360 dissertations and 360 published

articles. The population was defined using the filtering search criteria human resource

development, training and development, organizational development, career

development, or HRD. This study described in detail the process of collecting content

similarity analysis (CSA) metadata using Turnitin software (www.turnitin.com). This

researcher conducted robust descriptive statistics, a Wilcoxon signed-rank statistic

between the similarity indices before and after false positives were excluded, and a

multinomial logistic regression analysis to predict levels of plagiarism for the

dissertations and the published articles. The corpus of dissertations had an adjusted

rate of document similarity (potential plagiarism) of M = 9%, (SD = 6%) with 88.1% of

the dissertations in the low level of plagiarism, 9.7% in the high and 2.2% in the

excessive group. The corpus of published articles had an adjusted rate of document

similarity (potential plagiarism) of M = 11%, (SD = 10%) with 79.2% of the published

articles in the low level of plagiarism, 12.8% in the high and 8.1% in the excessive

group. Most of the difference between the dissertations and published articles were

attributed to plagiarism-of-self issues which were absent in the dissertations. Statistics

were also conducted which returned a statistically significant justification for employing

the investigative process of removing false positives, thereby adjusting the Turnitin

results. This study also found two independent variables (reference and word counts)

that predicted dissertation membership in the high (.15-.24) and excessive level (.25-

1.00) of plagiarism and published article membership in the excessive level (.25-1.00) of

plagiarism. I used multinomial logistic regression to establish the optimal prediction

model. The multinomial logistic regression results for the dissertations returned a

Nagelkerke pseudo R2 of .169 and for the published articles a Nagelkerke pseudo R2

Robin James Mayes

ACKNOWLEDGEMENTS

I would like to thank all of the professors, instructors, and administrators who

have personally taken an interest in my continuing education at the University of North

Texas. In particular, I would like to thank my Dissertation Committee. Professor Kim

Nimon has been instrumental in helping me focus on this topic, providing the editorial

and statistical support needed for the complexities required in corpus analysis. From

her, I gained valuable understandings of how to serve my future students in their

dissertation quest. Professor Jeff Allen has guided me with calming advice about the

difficulties that Ph.D. students encounter. Professor Lin has shared many service

opportunities with me, enabling me to learn about their importance to the University.

Moreover, I would like to acknowledge Professors Kinshuk, Cathy Norris, Mike

Spector, Mickey Wirschenski, Jerry Wirschenski, and John Turner for their ongoing

support, including sound advice for research, teaching, and service opportunities.

I would like to thank my Fiancée, Pamela McCleary, my daughters Polly

Pinneaux and Katie Minder, and my extended families (blood and in-laws) for their

patience. I often pushed aside personal obligations and activities while I focused on my

educational demands. I cannot say enough about the colleagues, friends, and family,

with whom I could share my experiences and perceived stresses. Their sacrifices are

hard to identify, but I appreciate that they enjoyed or at least tolerated this prolonged

professional and personal adventure.

TABLE OF CONTENTS

Page ACKNOWLEDGEMENTS ............................................................................................... iii LIST OF TABLES ............................................................................................................ vi LIST OF FIGURES ......................................................................................................... vii INTRODUCTION ............................................................................................................. 1

Purpose and Rationale of Study ........................................................................... 4 Research Questions ............................................................................................. 5 Delimitations ....................................................................................................... 10 Limitations .......................................................................................................... 12

LITERATURE REVIEW ................................................................................................. 15

Plagiarism Definition, Categories, and Types ..................................................... 15 Factors Contributing to Plagiarism ...................................................................... 24 Consequences of Plagiarism .............................................................................. 27 Empirical Research ............................................................................................ 30 Reviewing Plagiarism Software .......................................................................... 43

METHODOLOGY .......................................................................................................... 47

Research Design Overview ................................................................................ 47 Population/Sample ............................................................................................. 50 Data Collection Process 1 .................................................................................. 54 Data Collection Process 2 .................................................................................. 56 Data Export Process ........................................................................................... 63 Data Collection Summary ................................................................................... 65 Data Analysis ...................................................................................................... 66

RESULTS ...................................................................................................................... 73

Dissertation Descriptive Statistics Results .......................................................... 73 Dissertation Differential Statistics Results .......................................................... 78 Dissertation Predictive Statistics Results ............................................................ 82 Published Article Descriptive Statistics Results .................................................. 88

Published Article Differential Statistics Results................................................... 93 Published Article Predictive Statistics Results .................................................... 96

DISCUSSION .............................................................................................................. 103

Discuss and Synthesize Research Findings ..................................................... 103 Descriptive Findings ......................................................................................... 104 Differential Findings .......................................................................................... 111 Prediction Findings ........................................................................................... 114 Discuss Document Similarity Levels ................................................................. 117 Issues and Obstacles ....................................................................................... 121 Conclusions ...................................................................................................... 125 Implications....................................................................................................... 128 Future Research ............................................................................................... 134

APPENDIX A: DISSERTATION FREQUENCY DETAIL TABLES ............................... 137 APPENDIX B: PUBLISHED ARTICLES FREQUENCY DETAIL TABLES .................. 140 APPENDIX C: SPSS AND R SYNTAX ........................................................................ 142 APPENDIX D: CSV FILES FIELD DESCRIPTIONS ................................................... 144 APPENDIX E: TURNITIN COA REPORT (NO ADJUSTMENTS) ............................... 147 REFERENCES ............................................................................................................ 148

LIST OF TABLES

Top Five Reported Retractions Rates for Plagiarism ............................................ 4

2. Comparison of Corpus Plagiarism Study Evaluation Levels ................................. 7

3. Variables Used for the Corpus Analysis in RQ1, RQ2 & RQ3 .............................. 8

4. Corpus-wide Descriptive Statistics for Dissertations ........................................... 73

5. Corpus-wide Spearman's rho Statistics for Dissertations ................................... 76

6. Descriptive Statistics for aDSI & aDSW for Dissertations by Groups ................. 77

7. Modeling MLR Analysis of Sampled Dissertations ............................................. 85

8. MLR Analysis of Sampled Dissertations ............................................................. 87

9. Corpus-wide Descriptive Statistics for Published Articles ................................... 89

10. Corpus-wide Spearman's rho Statistics for Published Articles ........................... 92

11. Descriptive Statistics for aDSI and aDSW for Published Articles........................ 93

12. Modeling MLR Analysis of Sampled Published Articles ...................................... 99

13. MLR Analysis of Sampled Published Articles ................................................... 101

14. Dissertation Corpus Plagiarism Studies Using Turnitin .................................... 107

15. Published Article Corpus Plagiarism Studies Using Turnitin ............................. 109

LIST OF FIGURES

1. Plagiarism category and type map ..................................................................... 16

2. Corpus plagiarism study process chart ............................................................... 48

3. Turnitin SSI interfaces for selecting SSI for examination and exclusion ............. 50

4. Random sampler software .................................................................................. 53

5. Adobe Acrobat JavaScript .................................................................................. 56

6. An example of a dissertation template similarity ................................................. 57

7. Flow diagram for verification of Turnitin content originality report ....................... 62

8. SRT Turnitin COA report data collection interface .............................................. 63

9. Spreadsheet formulas for double-checking SRT-DSI final adjustments ............. 64

10. Dissertation DSI & ADSI frequencies before and after adjustments ................... 79

11. Histogram exhibiting difference between dissertation paired DSI & aDSI .......... 80

12. Dissertation membership in document similarity levels ...................................... 82

13. MS Word VBA script for building all possible string subsets ............................... 84

14. Dissertation scatter plots for word count and reference count variables ............. 87

15. Article DSI & aDSI frequencies before and after adjustments ............................ 94

16. Histogram exhibiting difference between published article paired DSI & aDSI ... 95

17. Published article membership in document similarity levels ............................... 96

18. Published article scatter plots for word count and reference count variables ... 102

19. Trivial similarities of 6 words using a 10 word exemption ................................. 121

INTRODUCTION

Once Johannes Gutenberg invented the printing press in 1440, the stage was set

for a massive increase in the abuse of intellectual property rights, including plagiaristic

activities (Shelley, 2005). Moreover, the parallelisms of mass duplication and broader

distribution opportunities between the Gutenberg press and the global embracing of the

Internet further exacerbated the plagiarism problem (Chao, Wilhelm & Neureuther,

2009). Kock (1999) observed that the digitization and availability of documents for

global audiences had encouraged plagiaristic activities on a scale never experienced

before.

Cheung and Driver (2004) reaffirmed the existence of plagiarism and examined

its unintended consequences. They cautioned about plagiarizing existing knowledge

and advised, “Though researchers may examine the same topics from many angles and

in many populations, the scientific process is hindered when inquiries provide no new

contribution” (p. 7).

While academia promotes and expects quality publishing among its echelon of

researchers and professors (Kock, 1999), public universities facing reductions in state

and federally funded budgets have focused on increases in grant-funded research and

publishing (Shaw, 2002). Kock (1999) postulated that it is commonplace for universities

to reward young researchers or professors with promotions, pay increases, and offers of

tenure, greatly influenced by high publication counts. Moreover, academic publishing is

experiencing more student contributions. Hatch and Skipper (2016) examined 500

curriculum vitae from social science Ph.D. students and found that they had “averaged

4.3 peer-reviewed articles or book chapters before graduation” (p. 171). Shaw (2002),

O’Connor (2010), and Callahan (2014) maintained that these pressures to publish have

had its downside. They argued that increases in publishing pressures have diminished

the quality and importance of research and publishing activities. Baker (2015) affirmed

there were publishing quality problems when they reported that almost 50% of papers

they studied reported inaccurate statistical significance (p) values, some of which

directly affected the results of the published papers.

Karabag and Berggren (2012) further suggested that by reviewing the number of

retractions issued by publishers and authors, one could gauge the level of plagiaristic

activities. They reasoned that while institutions tended to shroud accusations and

resolutions of constituent plagiarism, publishers must publically retract manuscripts

when legitimate plagiarism issues were brought to their attention. In support of

Karabag's and Berggren's assertion of plagiarism and academic dishonesty, Cabral-

Cardoso (2004) discussed a bitter internal political struggle at a business school that

attempted to resolve an alleged accusation of plagiarism. He reported that:

In the case reported here, the informal rule appeared to be: “better keep things

quiet and out of public eyes” saving the university the embarrassment of having

to try to revoke an awarded degree and challenge some senior faculty. (p. 85)

Nevertheless, placing publication issues square in the public eye, SAGE

Publications retracted 60 articles from the Journal of Vibrations and Control (Retraction

Watch, 2014). The retraction notice stated that SAGE found a peer review ring created

by a single person who created multiple aliases using different email and SAGE user

accounts. While the notice did not list plagiarism as the driving force for the retractions,

there were serious fraudulent authorship issues that led to duplicate publication

submissions.

In another attempt to identify trends in article retractions, Wager and Williams

(2011) reported that the Medline medical literature had experienced a tenfold increase

in retractions over a ten-year period starting in 1999. Understanding that not all

retractions were plagiarism related, Decullier, Huot, Samson, and Maisonneuve (2013)

analyzed 235 retraction notices. They determined that while 28% of the articles were

retracted for mistakes, 20% were plagiarism related followed by fraud at 14%.

Amos (2014) studied retractions from the international biomedical field literature.

She searched for retractions that resulted from plagiarism, and duplicate publication

issues. Based on the findings from her exploratory study, she reported 20 national

affiliations in order by the number of all retractions counts for the period between 2008

and 2012. China, the United States, and India, all global economic powerhouses, round

out the top three places in her study. Amos (2014) concluded that:

Exploring plagiarism and duplicate publication across countries contributes to

understanding publishing and retraction practices. Only a very small percentage

of the published literature is ever retracted, and an even smaller percentage of

that literature is retracted because of plagiarism or duplicate publication …

However, these two reasons combined accounted for nearly 35% of all

retractions in the studied sample. (p.89)

Table 1 displays the top five national affiliation rankings of the 20 national affiliations

identified by Amos.

Purpose and Rationale of Study

The purpose of this study was to identify potentially plagiaristic activities by

examining corpora of dissertations and published articles focused on of human resource

development (HRD) using Turnitin content similarity analysis software. Swanson (1995)

defined HRD as “a process of developing and unleashing human expertise through

organization development and personnel training and development for the purpose of

improving performance” (p. 208).

Turnitin produces content originality reports or more descriptively correct, content

similarity reports. For the most part, these two terms are identical. While examining

content similarity indices and document descriptive data, this empirical study follows the

techniques and analytics similar to previous published empirical studies (e.g., Honig &

Bedi, 2012; Ison, 2012; Ison, 2014; Sun, 2013; Thomas & de Bruin, 2015).

Additionally, there is precedence of studying dissertations and published articles

for plagiarism. Ison (2012, 2014) examined plagiarism in dissertations in two related

Table 1

Top Five Reported Retractions Rates for Plagiarism and Duplicate Publication

National Affiliation Plagiarism Duplicate Publishing Totals

China 24 42 66 United States 17 26 43 India 18 7 25 Italy 16 2 18 Japan 2 13 15

studies. Honig and Bedi (2012), Sun (2013), Thomas and de Bruin (2014), and others

have examined plagiarism in published articles.

The rationale for this HRD-focused corpora plagiarism study parallels with what

Honig and Bedi (2012) reported as their underlying reasons for their corpus plagiarism

study in the discipline of management and administration:

We strongly believe this study shows the need to identify and verify the originality

of scholarship and should be an increasingly important responsibility of the

Academy of Management. It is our hope is that this study leads to the

development and implementation of specific screening systems, as well as more

repetitive and transparent ethical guidelines, in order to enhance the scholarship

standards represented by the Academy of Management. (p. 118)

Moreover, this study adds to the literature by identifying and describing the

process of investigating plagiarism, using Turnitin on existing documents, and

answering the following research questions.

Research Questions

For this exploratory quantitative study, I considered three research questions for

each corpus (dissertations and published articles) based upon Turnitin content similarity

results and document metadata. Turnitin defines content similarity results as data

collected from a document using a plagiarism detection software system (i.e., Turnitin).

Rouse defined document metadata as “information attached to a text-based file that

may not be visible on the face of the document” (2014, p. 1)

The Turnitin content originality report provided various related content similarity

data. The document similarity index (DSI) and the document similarity word counts

(DSW) were the starting points. The DSI was the amount of the document’s total

content similarity that Turnitin had identified with other documents in its document

collection database before any exclusions or adjustment were applied. The DSW was a

synthesized value derived from the document word count times the DSI (a percentage).

I adjusted the DSI using “qualitative judgments” for excluding “false text-matching

incidences” of source similarity indices (SSI) from the calculations (Sun, 2013, p. 267).

The SSI-substantive types referred to in the research questions are the SSI equal or

larger than 5%. I recorded the adjusted DSI as the aDSI. The adjusted document

similarity word (aDSW) count was a synthesized value derived from the document word

count times the resulting aDSI (a percentage).

The document metadata collected were document research method (quantitative,

qualitative, mixed, or other), year of publication, author count, word count, and

reference count. As each document was selected, I collected this metadata information

during the document review. I recorded the collected data in the Scholarly Research

Tracker (SRT; Mayes, 2016) database.

The study examined many of the different evaluation strategies and incorporated

various features from all of them (see Table 2). There were several examples of

plagiarism evaluation categories using two and four level criteria. This study employed

the labels “Low,” “High,” and “Excessive” which I partially derived from a combination of

Thomas and de Bruin’s (2014) labels. This study also used Thomas and de Bruin’s

(2014) rate levels because they based them upon the aDSI (overlaps and false

positives removed). For the first level, the first two rates were combined: “Low” was 0%-

14%, “High” was 15%-24%, and “Excessive” was 25%-100%.

Bedeian (2014) emphasized the importance that descriptive statistics afford a

researcher in a basic understanding of primary data collected in a quantitative study.

The first set of research questions channeled the research toward a statistical

description of the corpus of sampled dissertations and published articles employing

distributions, groupings, categories, and measures of central tendencies. See Table 3

for a list of variables. Moreover, the SSI-substantive types referred to in the research

questions are the SSI equal to or larger than 5%. These SSIs were descriptively

Table 2

Comparison of Corpus Plagiarism Study Evaluation Levels

Authors Year Variable L1 L2 L3 L4

Turnitin 2017 DSI Green 0-24%

Yellow 25%-49%

Orange 50%-74%

Red 75%-100%

Mayesa 2017 aDSI Low 0-14%

High 15-24%

Excessive 25%-100%

Thomas and de Bruin 2014 aDSI Low 0-9%

Moderate 10%-14%

High 15%-24%

Excessive 25%-100%

Zhang & Jia Survey Resultsb

2012 OSI (DSI) Minor 8.99%

Moderate 21.69%

Serious 38.78%

Rejection 50.49%

Masic 2012 Non Original Acceptable 0-24%

Rejected 25%-100%

Walker 2010 DSI Moderate 0-19%

Extensive 20%-100%

Batane 2010 aDSI Legitimate 0%

Low 1%-34%

Medium 35%-69%

High 70%-100%

Bretag and Mahmud 2009 DSI 0-10% 11%-24%

Higher Education Commission, Pakistana

n.d. University Guidelines for Turnitin

Acceptable 0-18%

Rejected 19%-100%

aSet SSI Revision Rate at 5% bUsed the DSI evaluation rates for evaluating the SSI rates.

identified as the potential of plagiarism-of-other or plagiarism-of-self.

RQ1.1: What are the descriptive statistics of Turnitin’s reported document similarity

indices (DSI), including percentages and synthesized word counts; researcher-adjusted

document similarity indices (aDSI), including percentages and synthesized word counts;

source similarity indices (SSI-substantive type), including percentages and synthesized

word counts; and document metadata for the corpus of sampled dissertations?

Table 3

Variables Used for the Corpus Analysis in RQ1, RQ2 & RQ3

Variable Name RQ Scale Values

Document Similarity Level (DSL) [DV] 3 Polychotomous Low, High or Excessive

Year of Publication (YOP) [IV] 1, 3 Continuous 2011-2015

Research Method (DRM) [IV] 1, 3 Categorical Quant, Qualt, Other

Author Count (ACT) [IV] 1.2, 3.2 Continuous Greater than 0

Word Count (WCT) [IV] 1, 3 Continuous Greater than 0

Reference Count (RCT) [IV] 1, 3 Continuous Greater than 0

Document Similarity Index (DSI) [DV] 1, 2 Continuous 0-100%

Document Similarity Word Count (DSW) 1 Continuous Greater than 0

Adjusted Document Similarity Index (aDSI) [DV] 1, 2, 3 Continuous 0-100%

Adjusted Document Similarity Word Count (aDSW) 1 Continuous Greater than 0

Substantive Other SSIa 1 Continuous Frequency

Substantive Self SSIa 1 Continuous Frequency

Substantive Other SSI (Mean Similarity)a Continuous 5-100%

Substantive Self SSI ( Mean Similarity)a 1 Continuous 5-100%

Substantive Other SSI (Mean Word Count)a 1 Continuous Greater than 1

Substantive Self SSI (Mena Word Count)a 1 Continuous Greater than 1

aSubstantive SSI are equal to or greater than 5%

document similarity indices (aDSI) including percentages and synthesized word counts;

word counts; and document metadata for the corpus of sampled published articles?

Batane (2010) found that Turnitin suffered from a “tendency of the software to

identify the material as plagiarized” (p.3). He suggested that Turnitin users verify all

instances of identified content similarities and make the necessary adjustments. This

study provided an opportunity to identify the statistical and practical significance of the

required adjustments or corrections. In simpler terms, RQ2.1 and RQ2.2 ask the

question “is it necessary for a plagiarism researcher to verify the results of a Turnitin

analysis report?

RQ2.1: Are there statistically and practically significant differences between the levels of

Turnitin’s reported document similarity indices (DSI) and my adjusted document

similarity indices (aDSI) for the corpus of sampled dissertations?

similarity indices (aDSI) for the corpus of sampled published articles?

Often corpus-based plagiarism studies analyze documents for evidence of

plagiarism and include predictive analytics based upon various researcher available

predictor variables against the measured plagiarism values (e.g. Honig & Bedi, 2012;

Ison, 2012; Perfect, Defeldre, Elliman & Dehon, 2011; Sun, 2013; Thomas & de Bruin,

2014). However, there are few actionable research outcomes based on the findings.

The profession should conduct additional research on an ongoing basis. This study

includes a predictive research component.

RQ3.1: Does document research method, year of publication, word count, and

reference count predict membership in low, high or excessive levels of the plagiarism

categories for the corpus of sampled dissertations?

RQ3.2: Does document research method, year of publication, author count, word count,

and reference count predict membership in low, high, or excessive levels of the

plagiarism categories for the corpus of sampled published articles?

Delimitations

Delimitations set a study’s boundaries (Simon, 2011). The first delimitation was

the selection of two widely commercially accepted document databases (EBSCO and

ProQuest Dissertation and Theses). My reliance on EBSCO and ProQuest was critical

to my study. However, from the researchers’ standpoint, Turnitin’s internal functionality

and accuracy are undocumented. Moreover, Turnitin’s document acquisition process

and the schedule of document availability for researchers was unknown. For example,

ProQuest dissertations, by author choice, can remain unavailable to queries for up to

five years. I also had found that most EBSCO query results would report a larger

number of documents than were listed and retrievable. For example, a user query might

retrieve ten pages and 120 document. Upon review, as a user reached the eighth page,

there were no more documents to review and far less than 120 documents. Given these

potential anomalies, during the engineering phase of this study, I secured the document

population on which I would later build my sample. I limited the corpora to the publishing

years 2011 through 2015 as described in the methodology section, securing a

document population file through the end of the 2015 calendar year.

I selected Turnitin software for the content originality analysis because Turnitin is

one of the leading COA software applications. However, Turnitin only identifies content

similarities and cannot distinguish between the different categories and types of

plagiarism (see Figure 1, p. 16). When Turnitin could provide information that would

lead to a reasonable differentiation between substantive plagiarism-of-others and

plagiarism-of-self, I recorded that evidence.

The practice of reverse plagiarism is when an author gives another author credit

when none is warranted (Jent, 1967; Knight, 2013; Moten, 2014). Turnitin had no

practical way of detecting reverse plagiarism. This study does not include reverse

plagiarism in its calculations and analysis.

I configured Turnitin’s settings to remove as much non-material content similarity

as possible. I excluded quoted content and references, word strings less than ten

words, and student papers from the analysis (cf. Thomas & de Bruin, 2015). I manually

removed major publisher template similarities like repeated copyright notifications from

the documents, before submission to Turnitin. These pieces of text frequently show up

in Turnitin reports in significant numbers of SSI that erroneously affect the DSI.

Document format and size were areas of concern. A Turnitin COA report on a

PDF that is in image format was not possible. I converted documents in image format to

text format using Adobe Acrobat’s OCR capabilities before submission. However, there

was one document, which contained many images of the author's hand-written notes as

content. Turnitin did not include these images in the COA report.

Turnitin had a submission limitation of 400 pages. I submitted one document

larger than 400 pages in parts. I removed document pages not requiring an originality

check from the largest documents to reduce pages below 400. These document

preparation and analysis procedures were consistent, despite variances in document

type, author, word, reference counts, and research method employed.

The research revealed there was much concern about plagiarism by English-as-

a-second-language (ESL) authors (Sun, 2013). This phenomenon was beyond the

scope of this study because obtaining ESL demographic data would be difficult.

Moreover, this study did not identify plagiarism in language translations. Whether this is

a case of plagiarism-of-others or plagiarism-of-self, articles translated from one

language to other languages are difficult to process with Turnitin. While in its infancy,

researchers are continuing with the development of improved language translators that

may lead to better identification of cross-language content similarities (PT, 2011).

Limitations

Limitations are the potential shortcomings or weakness in the design of a study

(Simon, 2011). Recognizing the first limitation (page 11), I collected a random sample of

documents, predominately selected from the human resource development (HRD) field

using keyword-filtering technologies. These document searches returned various cross-

disciplinary documents that were included in this study. However, I do not generalize

this study's results beyond the HRD field.

Another limitation is that I conducted this study over a short period, and my

readers should not generalize the results as possessing any longevity beyond the

current strategies and technologies. Nor could I control the population counts within

each subpopulation. I experienced a tendency in the ProQuest and EBSCO to have

decreasing subpopulations in the later periods within a queried range. As a follow-up

examination, I queried both databases for the years 2012-2016 (an increase in one

year) in January 2017 and found the year 2016 had less than 1/2 the document count

as compared with each of the other individual years (2012-2015). One might conclude

that document database publishing regularly suffers collection, verification, and

processing delays.

The reliance on the features and accuracy of Turnitin, including the diversity of its

document collection database, was another limitation. Turnitin only identified text

similarities between the submitted document and the available documents located in its

document collection database (Mulcahy & Goodacre, 2004). Turnitin cannot include all

articles and textbooks in the Turnitin document collection database. Therefore, the

Turnitin report process can lead to Type 1 Errors or the failing to identify potential

content similarities (cf. Stevens, 2009). Moreover, while Turnitin can remove quoted

materials from the COA, Turnitin does not always do so, thus leading to Type 2 Errors

or falsely identifying content similarities (cf. Stevens, 2009). To clarify, for the remainder

of this study, a false positive (Type 2 Error) refers to content similarities that are not

evidence of plagiarism. A common example of a false positive is a publisher inserting a

notice in the document; Turnitin may identify the content of the notice as plagiarism

across many documents with the same notice.

Turnitin’s Parent company, iParadigms, engineered Turnitin to analyze

unpublished or recently published manuscripts. The COA process becomes complex for

previously published documents with copies and pieces of those documents spread

throughout the Internet. I found that Turnitin identified three types of source similarity

indices (SSIs): Publication, Internet source, and Student paper. The publication SSIs

were easy to investigate if Turnitin provided a direct link to the source document. If the

document collection database is the only source, it was difficult to read and could not be

searched or select and copied. The Internet source SSIs also provided documents to

validate similarities. However, text similarities in these documents were often trivial and

template based. Text retrieved from web pages such as, but not limited to, document-

abstracts and keywords were numerous. The job of validating similarities was often

difficult because of a lack of listed author names and publication dates. The SSIs

identified as Student papers were the most difficult to validate. If the students or

instructors have submitted these documents to Turnitin, the submissions were placed in

a separate document collection database (DCD) apart from the published article DCD.

However, Turnitin would not make the documents available for inspection without

permission from the person who submitted them to Turnitin. This study followed

Thomas and de Bruin (2015) and removed the student document collection database

from consideration in the content originality report as these documents were not

published or copyrighted

Another limitation is Turnitin’s sole reliance on percentages. Should a 100,000-

word document with 6% similarities (6000 words), be compared with a 10,000-word

document with 20% similarities (2000 words)? The question begs to be asked, “Which

document has the more serious plagiarism problem. As Ison (2012) noted:

The size of dissertation [document] must be considered … when examining

similarity indices, as even a 2% overlap of a 200-page dissertation [document]

essentially means there are four pages worth of unoriginal material. (p.234)

LITERATURE REVIEW

Plagiarism Definition, Categories, and Types

The Committee on Publication Ethics (COPE) formed in 1977 to address a lack

of formal guidance with which to handle unethical research and publishing conduct

(2014). A group of biomedical journal editors associated with COPE instigated a code of

conduct guidelines. Using the COPE guidelines, Hulten, Nicholls, Winslet and Kmiot

(2000) indicated that “plagiarism ranges from the unreferenced use of others' published

and unpublished ideas to submission under the new authorship of a complete paper,

possibly in a different language” (p. 247). However, in research of the literature,

plagiarism appears to be more complex and difficult to apply conceptually than what

Hulten et al. (2000) implied. According to a review of the literature, there are three main

categories of plagiarism: plagiarism [plagiarism-of-others] (O'Connor, 2010), self-

plagiarism [plagiarism-of-self] (Cheung & Driver, 2004; Yentis, 2010), and reverse

plagiarism (Jent, 1967; Knight, 2013; Moten, 2014). For the duration of this dissertation

and the purpose of adding a level of clarity, the term “plagiarism” will encompass all

three categories of plagiarism as shown in Figure 1. I used Turnitin’s text-similarity

detection system to identify all of the plagiarism types as identified as green in Figure 1.

Moreover, while Turnitin can detect content similarities leading to potential evidence of

plagiarism-of-others and plagiarism-of-self, it is unable to differentiate between the two.

Moreover, Turnitin cannot identify the types within the three categories. Any

identification beyond text similarities is left to an investigator and often subjective

decisions based upon a preponderance of the evidence.

Plagiarism-of-Others

Handa (2008) defined plagiarism-of-others as:

The failure to acknowledge other colleagues’ scientific work - their ideas,

language, or data. It [plagiarism-of-others] may include verbatim copying of

passages without citing the original contributor, rewording of ideas, paraphrasing,

Figure 1. Plagiarism category and type map. Using Turnitin this study identified potential overall plagiarism (blue box) by examining content similarities that included, but not limited to categories of substantive plagiarism-of-others and substantive plagiarism-of-self (green boxes). However, this study does not identify or summarize plagiarism types within each category (yellow boxes). Moreover, Turnitin cannot directly identify reverse plagiarism, data redundancy and layered citations (red boxes).

and even total reproduction by simply changing the authors’ names and trying to

pass the material as one’s own. (p. 301)

According to O’Connor (2010), plagiarism-of-others is a common publication

problem. There are various types of plagiarism-of-others (i.e., document, paragraph,

paraphrasing, and layered citations). Document plagiarism-of-others occurs when a

substantial, if not all of a document is presented as one’s own work. For example,

McMurtry (2001) found that students used web-based services (paper-mills) to buy

documents to be passed off as their own. She reported instances of students who had

paid as much as $35.00 a page for documents from web-based paper mills. However,

as Posner (2007) indicated, that kind of transgression would be better defined as an

academic fraud, not plagiarism. A student's intent is not to wrong the original author, but

to deceive the reader (faculty). However, there remains a question about who owns the

copyright should the students decide to publish their work.

Paragraph plagiarism-of-others is when one copies paragraphs or sentences

directly into one’s own work without proper citations. Chao, Wilhelm, and Neureuther

(2009) posited that powerful document databases such as Google Docs, EBSCO,

LexisNexis, and ProQuest provide researchers seemingly unlimited resources of

electronic text. The digital age has affected paragraph plagiarism-of-others with simple

“copy and paste” functionality (Chao et al., 2009; O'Connor, 2010). Paraphrasing is an

attempt to correct or hide paragraph plagiarism-of-others (Chao et al., 2009).

Inexperienced authors may copy and paste another’s work into their documents and

then change several of the words believing that this infraction is technically not

plagiarism-of-others. At a minimum, an author must provide a citation and reference as

the original idea remains another’s work. If the majority of words from the original

source remain, quotations are additionally required. If an author inadequately

paraphrases copied content, CSA software will still identify the similarities (Chao et al.,

2009). Whether to count poorly executed paraphrasing with citations, but without

quotation marks, as plagiarism is a subjective call made by the investigator.

Layered citations or citation overlaps can also cause plagiarism-of-others issues

even for the most experienced author. Tucci and Galwankar (2011) reported that a

quote is often credited to the publication from which the text was retrieved, but prior

unearthed documents may have been the original source. When citing one’s source, a

researcher must execute due diligence in identifying prior publications where that

content might have first originated (Tucci & Galwankar, 2011). Researchers often use

the term "snowballing" to describe how references often lead to other related references

(Regmi & Naidoo, 2013, p.33). This process often reveals prior and sometimes original

sources. However, the APA (2010) stated that if one does not have access to older

works one must still include the reference, but may cite the source using this format “as

cited in” (p. 178).

Plagiarism-of-Self

One of the most misunderstood and least researched types of plagiarism is self-

plagiarism, which this study refers to as plagiarism-of-self. The APA (2010) identified

plagiarism-of-self as the “practice of presenting one’s own previously published works

as though it [previously published works] were new” (p. 170). Duplicate publication

(including translations), text recycling, and data redundancy are three types of

plagiarism-of-self (cf. Adhikari, 2010; Cheung & Driver, 2004; Roig, 2010; Yentis, 2010).

Duplicate publication often occurs when authors submit the same manuscript to

multiple journals (Cheung & Driver, 2004; Yentis, 2010). Susser and Yankauer (1993)

reported that while the inclusion or order of authors may have been changed, and the

article may have had some minor changes in the wording, duplicate publications are the

same work, just repackaged.

Language translations of prior works are also considered plagiarism-of-self

(Cheung & Driver, 2004). However, Yank and Barnes (2003) found that in practice,

about 30% of surveyed editors and authors felt that redundant publishing in a non-

English journal was acceptable. Sibbald (2000) added that duplicate publishing is often

critical for reaching diverse audiences. However, Sibbald cautioned that the copyright

holder must provide a release and the secondary publisher must be aware of the

previous publication (2000).

An often-misunderstood instance of duplicate publishing is the author(s) article

having been published in conference proceedings; then later submitted for publication in

a journal. If conference proceeding are publicly accessible or the authors relinquish their

copyright protection, the article cannot be published again without documented

permissions and notifications according to Sibbald (2000). Callahan (2012) warned

authors that they should:

Be vigilant about ensuring that your intellectual property is not openly accessible

online if you want to continue publishing on the topic. This is especially important

for working on conference papers you hope to publish as chapters or journal

articles. Another option is to ensure that, if you want to publish your work later,

you do not submit full papers to conferences, but instead submit only abstracts.

(p. 8)

The APA (2010) uses the phrase “limited circulation” as a keyword in

understanding duplicate publishing issues (p.13). Governmental agency studies,

University Department reports, and U.S. dissertations are not normally marketed or

available to the public on a broad scale, thus can be republished in modified or

extended form. However, conference proceedings and book chapters that are offered to

the public are not considered in limited circulations and cannot be republished in whole

or part. Regarding APA brief reports, the APA stated that if they “include sufficient

descriptions of methodologies to allow for replications; the brief report is the archival

record for the work” (p.13). The brief, or variation of, cannot be republished without

publisher permissions and notification to the readers. Furthermore, the APA states if a

“brief report is published in an APA journal it is with the understanding that the extended

report will not be published elsewhere.” That may imply that this type duplicate

publication is acceptable if it is published within the same publishing house. The APA

does not say that other publishers may allow duplicate publication within their publishing

house.

According to Adhikari (2010) and Roig (2010), another form of plagiarism-of-self

occurs when authors use parts of their previous works in current works without proper

citations. They both referred to this type of plagiarism-of-self as “text recycling”

(Adhikari, 2010, p. 77; Roig 2010, p. 299). Often researchers start repeating

themselves, especially after publishing multiple manuscripts, using paragraphs and

even whole sections from their previous works (Adhikari, 2010; Roig, 2010). While

authors must be diligent in their disclosure of previously published ideas, APA (2010)

specified that if self-citation is awkward, an author could limit the depth of the citations.

A simple phrase (e.g., “as I have previously discussed”) often suffices (APA, 2010, p.

16). Moreover, APA (2010) deemed text recycling to describe analytical approaches as

acceptable. Robinson (2014) argued that while plagiarism-of-self issues in the

biomedical field can be a serious matter, in other academic fields “institutions . . . should

not be rushing to incorporate strictures against all forms of textual recycling into their

academic integrity policies” (p. 275).

Data redundancy is another plagiarism-of-self (O’Connor, 2010; Yentis, 2010).

Analyzing subsets of data collected at the same time and reporting the results across

multiple studies and manuscripts may be considered a form of plagiarism-of-self. Terms

like “data salami slicing,” “data subdivision,” or “fragmented publication” are used to

describe this type of unethical research (e.g., Adhikari, 2010; Farthing, 2006; Karlsson &

Beaufils, 2013). Even if a study is analyzing different variables, using the same

database may be considered data redundancy. If there are any potential ethical

considerations, the author must provide an author’s note to the editor and readers.

According to APA (2010):

Data that can be meaningfully combined within a single publication should be

presented together to enhance effective communication . . . Authors must inform

the editor of any similar manuscript . . . Authors have a responsibility to reveal to

the reader that portions of the new work were previously published. (p. 14-15)

In itself, data redundancy may seem impossible to detect. However, Spielmans,

Biehn, and Sawrey (2010) found data redundancy plagiarism issues by identifying text

similarities across content using searches by topic, authors, and institution names.

There are much confusion and disagreement on how serious of a publication

problem plagiarism-of-self is. Established authors are often pitted against less

experienced authors trying to establish a publishing reputation. Callahan (2014) advised

HRD colleagues to:

Resist the urge to use the label self-plagiarism. Such a label facilitates a moral

panic that is unjustified and unjust . . . There is much to question about the extent

to which self-plagiarism is a real issue or a manufactured issue to serve the

interests of a selected few. (p. 7-8)

Along with a similar vein, Robinson (2014) concluded that while plagiarism-of-self

“misrepresents the nature and size of the author’s accomplishment” and potentially

fosters resentment among one’s colleagues, plagiarism-of-self “should not be treated as

an academic integrity issue” (p. 270).

However, Schminke and Ambrose (2014) authored an editorial explanation to the

retraction of one of their previous editorials. During activities of a doctoral class, the

students noticed an editorial by Schminke and Ambrose had considerable similarities

(26%) with another paper written and published by Schminke. Their instructor notified

the editors of the Academy of Management Review of the evidence of plagiarism-of-

self. Given the lack of a predefined course of action, the editors decided on a remedy

that included an article retraction and the authors editorializing on what happened and

why. An interesting aside was that the editors asked the students if this remedy was

sufficient. The students replied they were satisfied. The lesson learned from this case is

that any interested person, including students, can bring up even a renowned professor

or author on charges of plagiarism. Any professor or author could potentially find

themselves in a situation with career-damaging consequences (cf. David, 2011).

While students do not normally publish their assignments, Halupa (2014)

identifies student plagiarism-of-self as the result of students recycling their work across

multiple courses. There is some debate whether text recycling of unpublished content is

a plagiarism of any type. It is common practice for graduate students to build upon their

previous work; recycling has not been a serious issue. Moreover, citations in theses or

dissertations are not required for original work from a student’s previous unpublished

assignments. However, depending on an institution's academic policy, this type of

infraction could be an attempt to deceive the institution and faculty.

Reverse Plagiarism

Reverse or inverse plagiarism is the attribution of one’s own ideas to another

author (Jent, 1967; Moten, 2014). While the motives may seem beyond comprehension,

students have used reverse plagiarism to increase their reference counts (Greenbird,

2009). Authors have also attempted to add additional credibility to their works by

associating an idea with a more renowned author (Turmfalke, 2010). Shanmugam

(2009) studied trainee teachers’ assignment work looking for citation errors and found

that 16.01% of the citation errors were incorrect author assignments. While

Shanmugam did not identify these errors as reverse plagiarism, the evidence does

potentially support reverse plagiarism (2009).

A variation of reverse plagiarism is “editing as reverse plagiarism” (Knight, 2013,

n. p.). Knight reported he found that the editor had significantly changed a manuscript

which he had authored and submitted for publication. Furthermore, the publisher had

published the edited manuscript without his review. Knight felt uncomfortable in having

the edited version attributed to him and complained. The editor only offered an apology

for potentially hurting his feelings. Reverse plagiarism is extremely difficult to detect

even at the reviewing stage. Often instances of reverse plagiarism are only called to

attention by the authors, who had been erroneously cited.

Factors Contributing to Plagiarism

A review of the relevant literature (e.g., Fang, Steen, & Casadevall, 2012; Horner

& Minifie, 2011; Onwuegbuzie & Daniel, 2005) suggested that misconceptions, mistakes

or errors, fraud, and cultural differences are the primary factors contributing to

plagiarism. Samuelson (1994) added that “the sin of laziness” is at the heart of

plagiarism (p. 24). He stated if an author just rewrote their prose, they could avoid most

instances of plagiarism (Samuelson, 1994).

Misconceptions

The most common reason for plagiarism violations appears to be author

ignorance (Horner & Minifie, 2011). College students and sometimes their faculty are

not knowledgeable about what constitutes plagiarism. Often they have not received any

formal education on, nor any extensive experience with applying publishing ethics (e.g.,

Cheema, Mahmood, Mahmood & Shah, 2011; Marcus & Beck, 2011; Orim, Davies,

Borg & Glendinning, 2013). Moreover, variations in industry citation and reference

standards, often documented in multiple publishing guides, exacerbate the problem

(Auer & Krupar, 2001). Auer and Krupar (2001) found students became confused when

exposed to both APA and MLA citation formats. Considering that the EBSCO

publication database commonly provides seven different citation formats, one can

understand the difficulty in applying citation and reference standards. Furthermore, one

can conclude that few students understand copyright laws, fair use doctrine, and

potential litigation entanglements. Chao et al. (2009) demonstrated that students, who

have received instruction in plagiarism avoidance, were significantly less likely to

plagiarize.

Mistakes, Errors or Common Exceptions

Onwuegbuzie and Daniel (2005) indicated that the complexity of bringing a

manuscript to a publishable level could create instances of citation and reference errors.

Sometimes an author just forgets that they were not the source of an idea and failed to

cite the contribution. The term cryptomnesia as defined by Hege (2008) illustrates this

common plagiaristic mistake:

[Cryptomnesia] inadvertent plagiarism represents a memory failure in which

individuals either misattribute the source of the information to themselves rather

than to the true originator or they simply do not recall having encountered it

before and claim that it is their own novel creation. (p. ii)

However, the APA (2010) stated that to credit the cited source; citations and

references must be complete and correct. While these kinds of errors may not seem

serious, the publishing profession discourages citations that cannot direct the reader to

the cited work (Onwuegbuzie & Daniel, 2005). In general, the reader must have the

opportunity to check the accuracy of the source the author used.

There are areas where plagiarism is accepted, or at least remains unchallenged.

The writing of textbooks is a publication area where the reader has accepted that little in

a textbook is of original ideas and that if every sentence were cited, the book would be

unreadable (Posner, 2007). He posited that most readers of textbooks understand that

the content is a compilation of the topic as accumulated over time and for the most part,

are uninterested in knowing the originality of the various concepts.

Posner (2007) advocated that concealment (fraud) is an essential characteristic

of plagiarism. Searching for evidence that fraud exists, Fang et al. (2012) studied 2,047

retracted biomedical and life-science research articles. They posited that publishers

retracted 43% of the articles because of fraud or suspected fraud. While rare, an author

may commit fraud and intentionally attempt to pass off another’s product as their work

(Fang et al., 2012). In areas where potential plagiarism occurs, issues of ownership and

copyrights are often subject to litigation in civil court (Mawdsley, 2009; Posner, 2007).

Cultural Differences

Occasionally authors from one culture engage in what other cultures consider

plagiarism. There is evidence that not all cultures prescribe to the same ethical

standards regarding publication (Shi, 2006). Shi (2006) further explained that authors

whose second language is English might feel the need to engage in “textual

appropriation” or borrow words and paragraphs as they attempt to master the language

(p. 264). However, Liu (2005) rebuked the idea that plagiarism is an acceptable practice

in his Chinese culture. Liu described his educational experiences in China as having

never subscribed to the idea that plagiaristic activities were acceptable (2005)

Consequences of Plagiarism

Plagiarism is not a victimless crime and affects many stakeholders in the

research and publication profession. Authors, institutions of higher education,

corporations, editors, and publishers all have a stake in quality research and

publications. One's plagiaristic activities can appropriate profits by the taking of

copyrighted materials and then claim the spoils as from one’s own work (Posner, 2007).

However, Posner stated:

Though there is no legal wrong named “plagiarism,” plagiarism can become the

basis of a lawsuit if it infringes upon a copyright or breaks the contract between

author and publisher. (p. 34)

According to literature (e.g., Hendee, 2007; Karlsson & Beaufils, 2013; Neville &

Wadler, 2005), there is a wide variety of consequences for those engaging in plagiaristic

activities. Authors should consider the potential ramifications, ranging from a simple

article rejection to a more serious article retraction before engaging in unethical

publishing tactics. A founded accusation of plagiarism can discredit the author and can

lead to serious career-damaging consequences such as dismissal (Neville & Wadler,

2005). Moreover, accusations of plagiarism can lead to entanglement in legal

proceedings (Kock, 1999; Posner, 2007; Rubio, 2013).

Article Retraction

Article retraction is the most visible consequence of plagiarism when the

redundant publication is the cause (Karlsson & Beaufils, 2013). Amos (2014) studied

retractions in the biomedical literature. She reported that out of 754 retractions about

one-third of them (253) were from plagiarism (130) and duplicate publication infractions

(123). She also captured the authors’ national affiliation and created a top 20 national

affiliation ranking by retractions counts. See Table 1 for a national affiliation ranking of

the top five of her 20 by plagiarism and duplicate publication infractions.

SAGE Publications retracted 60 articles from the Journal of Vibrations and

Control (Retraction Watch, 2014). The retraction notice stated that SAGE found a peer

review ring created by a single person creating multiple aliases using different email and

SAGE user accounts. While the notice did not list plagiarism as a driving force of the

retractions, there were serious fraudulent authorship issues that led to duplicate

publication submissions.

Karlsson and Beaufils (2013) posited that publication retractions could critically

reflect on the reputation of publishers and their editors. Moreover, disseminating

retraction notices is a complex process as publishers must notify international

databases and provide justifications for retractions. Furthermore, a retraction can lead

to the death of a publishing career. An institution can place an author in censorship

status, and an affected publisher could ban the author(s) from future submissions

(Karlsson & Beaufils, 2013; Wittmaack, 2005).

Legal Proceedings

Hendee (2007) stated a major consequence of any engagement in plagiaristic

activities is often entanglement in costly legal proceedings. A publisher who owns the

copyrighted material may engage in litigation against a plagiarizer, even when the

plagiarizer is the original author. Cheung and Driver (2004) warned, “redundant

publication or self-plagiarism [plagiarism-of-self] can constitute copyright infringement if

authors reuse text or elements of papers that they no longer own” (p. 6). However,

Cheung and Driver (2004) concluded that the U.S. legal system remains sympathetic to

those who have reused their work and often rules in their favor.

Probably the most serious consequence resulting from plagiaristic activities is the

chance of entanglement in criminal proceedings (iThenticate, n.d.). For example, if

federal research grants are involved, and the research leads to a misuse of funds,

criminal proceedings may be involved. The Inspector General Act of 1978, as Amended

authorized the Federal Office of Inspector General the authority to investigate and

recommend prosecution of cases where grant recipients have misused government

funds, including research grants tainted by research misconduct (U.S. Government,

2014). The Office of Inspector General reaffirmed these efforts in their Semiannual

Report to Congress (2014) which stated:

Research misconduct damages the scientific enterprise, is a potential misuse of

public funds, and undermines the trust of citizens in government-funded

research. It is imperative to the integrity of research funded with taxpayer dollars

that NSF-funded researchers carry out their projects with the highest ethical

standards. For these reasons, pursuing allegations of research misconduct

(plagiarism, data fabrication, and data falsification) by NSF-funded researchers

continues to be a focus of our investigative work. (p. 21)

Moreover, the entanglement of unethical research with the legal system extends

beyond the United States. Rubio (2013) and Cromwell (2012) reported on a Columbian

Supreme Court decision, which sentenced a professor in May of 2010 “to two years in

prison plus monetary and civil sanctions for plagiarizing a student’s thesis” (Rubio,

2013, p. 141).

Tainted Research

Still an important, yet less visible consequence is that all categories of plagiarism

often misrepresent the influential weight of research (Cheung & Driver, 2004). Cheung

and Driver (2004) explained that duplicate conclusions coming from plagiaristic activities

make findings more credible than they should be. On the other hand, data redundancies

(salami slicing) tend to dissect knowledge into pieces as opposed to fitting the

discoveries together (O’Connor, 2010; Yentis, 2010). While these piecemeal activities

generate more articles, these activities also circumvent any comprehensive

understanding of the explained phenomenon (O'Connor, 2010; Yentis, 2010).

Empirical Research

Academic literature, including dissertations, appears to be increasingly rich in

plagiarism research. Based on a keyword search of articles published in EBSCO from

the year 2011 to 2015, using the term “plagiarism,” EBSCO returned 2,240 published

scholarly, full text, peer-reviewed articles. An equivalent search of dissertations using

ProQuest Dissertations & Theses Global returned 5,982 documents. I limited my

literature review to published literature and excluded dissertations. I further reduced the

literature count by adding the search term “study” which then returned 176 documents.

From these documents, the relevant studies tended to focus on plagiarism from the:

• Student perspective (e.g., Chao, Wilhelm, & Neureuther, 2009; Cheema,

Mahmood, Mahmood & Shah, 2011; Hege, 2008, Siaputra, 2013)

• Faculty perspective (e.g., Allen, Ball, & Smith, 2011; Bennett, Behrendt &

Boothby, 2011; Halupa & Bolliger, 2013; Marcus & Beck 2001; Olt, 2007),

• Perspective of editors and reviewers (e.g., Broome, Dougherty, Freda, Kearney,

& Baggs, 2010; Elbeck, 2009; Zhang, & Jia, 2012).

Moreover, there were several studies that examined and reported levels of

plagiarism in published works (Sun, 2013; Thomas, & de Bruin, 2014). There were few

studies, which reviewed techniques for employing plagiarism detection software (Hill &

Page, 2009; Heather, 2010).

Student Perspectives

Ling (2006) interviewed 46 undergraduate students from five different language

backgrounds: Native-English-speaking (n = 11), German (n = 10), Chinese (n = 8),

Japanese (n = 9), and Korean (n = 8). He reported that:

Findings suggest that the majority of participants were not sure about whose

words and which ideas they needed to cite with acknowledgment in their writing.

Many participants who speak English as a second language (L2) also expressed

concerns about being accused of copying as innocent language learners and

some with nonwestern backgrounds also found the concept of plagiarism foreign

and unacceptable. (p. 264)

Cheema, Mahmood, Mahmood, and Shah (2011) surveyed 60 doctoral and

masters’ students about their understanding of plagiarism. While they found that

students understood the basics of plagiarism, students lacked sufficient knowledge

about specifics and the possible repercussions from being involved with plagiaristic

activities. Similarly, Orim et al. (2013) interviewed 18 students and found that they had

an inadequate understanding of plagiarism and concluded that university instruction did

not include sufficient plagiarism awareness.

Siaputra (2013) studied personality traits of students who plagiarize and found a

statistically significant correlation (r = .27) between levels of plagiarism and

procrastination. Hege (2008) investigated how the mood of a student could affect their

ability to identify sources from memory. She found that students in a happy mood made

more memory source errors than those in a sad mood. Thus, she concluded that

students in a happy mood would inadvertently plagiarize more often than students in a

sad mood would plagiarize.

While academic literature has identified some of the systemic issues relating to

plagiarism, there have also been corpus plagiarism studies that have tested theories of

intervention (e.g., Dee & Jacob, 2012; DeGeeter et al., 2014; Hege, 2008; Youmans,

2011). Dee and Jacob (2012) provided 28 students with instructions on “what

constitutes plagiarism and providing them with effective strategies for avoidance” (p.

423). Their results exhibited a 3.6% lower plagiarism rate when compared to a control

group who received no instruction. Chao, Wilhelm, and Neureuther (2009) studied the

effects of providing a group of students with “clear and specific instructions on reducing

plagiarism in a graded writing assignment” (p. 39). They found that students who

received instruction exhibited a 3.16% lower content similarity index than students in the

control group who did not receive the instructions. DeGeeter et al. (2014) used pre- and

post-intervention assessments to identify the effect of educational instruction on

plagiarism. They found, following the intervention, their sample of 252 students, had a

4% increase in the number of students who identified plagiarism.

Youmans (2011) conducted a similar intervention study using two groups of

students (n = 90). The instructor informed the experimental group that he would use

Turnitin to check for plagiarism. The control group was uninformed. The students who

were informed demonstrated no significant difference in DSI measurements (n = 44, M

= 7.59%, SD = 7.17%). In comparison, the measurements from the control group who

the instructor did not inform were (n = 46, M = 7.29%, SD = 7.10%). Youmans

concluded that informing students that their work would be checked with plagiarism

detection software was inconsequential in preventing plagiaristic student activities

(2011).

Faculty Perspectives

Halupa and Bolliger (2013) surveyed 340 faculty members (26.2% response rate)

and found that faculty perceived policies about plagiarism were not clear or understood

by faculty or students. Moreover, Bennett, Behrendt, and Boothby (2011) surveyed 159

instructors on what constitutes plagiarism and found that half of the respondents did not

see text recycling as a serious issue. Even fewer reported using software to check for

plagiarism in their work. However, concerning perceptions about faculty, Allen, Ball, and

Smith (2011) surveyed information systems researchers and faculty and found that 67%

had observed their colleagues engaged in plagiarism-of-self.

Marcus and Beck (2011) surveyed 99 speech and English faculty members for

their views on plagiarism. From 17 members who responded, they obtained 14 viable

surveys. They found that the respondents disagreed 50% of the time on what

constitutes plagiarism and were often not in accordance with their institution's policies.

Marcus and Beck proposed that faculty participate in additional training and professional

development. Olt (2007) conducted a qualitative study of responses from 28 faculties

across the U.S. on what would constitute course structure that remedies plagiaristic

activities by students. Resulting from her plagiarism prevention research, she proposed

a set of tasks for on-line course development and delivery, which included:

Design prevention-focused syllabi

Design plagiarism-resistant courses

Design plagiarism-resistant assignments

Ensure manageability

Model ethical behavior

Encourage interactivity

Provide feedback

Build strong relationships and trust

(Olt, 2007, p. 122)

Editor and Reviewer Perspectives

Broome, Dougherty, Freda, Kearney, and Baggs (2010) conducted a survey of

reviewers in the field of nursing. Over 1,600 respondents answered open-ended

questions involving publishing ethics. Sixteen percent of the reviewers indicated that

they were directly involved with detecting plagiaristic activities. Of the 16%, 98%

reported their findings to the editor, and 80% of those who reported were satisfied with

the outcome of their efforts.

Elbeck (2009) surveyed 26 journal editors (27% response rate) regarding

plagiarism-of-self. Elbeck found that 80% of the respondents would reject highly

plagiarized-of-self manuscripts. Additionally, 15% of the respondents would forward

evidence of severe plagiarism-of-self to the author’s department chair and college dean.

However, Elbeck found that only 12% of the respondents reported the use of plagiarism

detection software in the pre-screening of submitted manuscripts (2009). More recently,

Zhang and Jia (2012) surveyed 3,912 journal editors (5.6% response rate) and found

that 42% of respondents had used a plagiarism detection tool. That is a large increase

in the use of a plagiarism detections tool when compared to Elbeck’s (2009) study,

three years earlier.

Measuring Evidence of Plagiarism

Although there have been extensive descriptions of what constitutes plagiarism,

there are no finite standards on which to determine a definitive detection of plagiarism.

While most schools publicize a zero tolerance for plagiarism, professional organizations

like COPE (2014) have not been able to provide standards or definitive measurements

for evaluating plagiarism, only suggestions for handling complainant initiated plagiarism

cases.

Turnitin employed a color-coding scheme based upon the DSI (UMUC, 2016).

The color blue indicates fewer than 20 words are similar, green equates to 0% to 24%

similarity, while yellow is 25% to 49%, orange is 50% to 74%, and red is 75% to 100%

(see Table 2). Turnitin does not say whether the colors are for the DSI before any

investigation or exclusions or after a plagiarism investigator has identified and removed

all exclusions.

Occasionally, researchers have provided benchmarks based upon DSI results

that they have used in determining the extent of alleged plagiarism. For instance,

Thomas and Bruin (2014) suggested using DSI values of “1% to 9% as low; 10% to

14% as moderate, 15% to 24% as high and equal to or greater than 25% as excessive”

measures of plagiarism (p.2). The Higher Education Commission, Pakistan (n.d.)

published these guidelines on plagiarism levels:

If the report has [a document] similarity index <=19%, then the benefit of

the doubt may be given to the author but, in case, any single source has

similarity index >=5% without citation then it [document] needs to be

revised. (p. 3)

Zhang and Jia (2012) surveyed editors on what they perceived as troubling levels

of plagiarism and found that the

The majority of respondents indicated that if between one-quarter [25%]

and one-third [33%] of the content in the abstract, introduction or

discussion is copied without citation, the paper is likely to be rejected. (pp.

296-297)

Masic (2012) posited that if 25% or more of an article is not original, a publisher

should take remedying action. Samuelson (1994) reported that her colleagues used a

30% acceptance rule for plagiarism-of-self. While these four examples provided

guidelines, the acceptable levels of plagiarism and appropriate corrective recourses are

subjectively left up to decision makers at individual institutions involved with the

publication process (cf. Toulouse Graduate School, 2016; MSU, n. d.).

Measuring Predictors of Plagiarism

Prediction is an important part of plagiarism research. In the field of statistics,

independent variables, also called predictor variables, are used to predict the outcome

in a dependent variable, sometimes called the outcome variable (cf. Field, 2011; Howell,

2010). Predictions are not to be construed as a cause, but only as a prediction of

outcome. In preparation for this exploratory study, I examined several studies to review

what researchers have already investigated in the area of plagiarism prediction.

This study identified three reasons that researchers conduct predictive analytics

in empirical studies. The first reason is using prediction analytics in a “primer” document

(Petrucci, 2009, p. 193). These type of studies clarify and add to existing techniques

used in statistical prediction analyses and often include history, theoretical foundations,

example datasets, statistical software syntax, results, interpretations, and reporting

techniques. The second reason is the need to use prediction analytics to identify

potential historical or future results (cf. Fahy, 2013). The third reason is to use predictive

analytics to possibly encourage, reduce, or prevent the predicted results (cf. Youmans,

2011). The focus of this study was to promote the understanding of potential plagiarism

reduction or prevention strategies. Bertolucci (2013) calls this prescriptive analytics.

However, Siegel (2016) add another layer to the predictive model. He describes the

actionable measure of any prediction analytics and posits that practical experience,

such as business experience, is very important to determine if any prescription is

actionable.

Directly applied to the study of plagiarism, Keck, (2006) studied the quality of

paraphrasing with L1 (n = 79) and L2 (n = 74) students. Using “Near Copy, Minimal

Revision, Moderate Revision, and Substantial Revision” (p. 261) to evaluate each

student’s attempt at paraphrasing the main points in a 1000 word essay, he found that

L1 writers produced better paraphrasing results and required less rewriting efforts that

L2 students.

Gibelman and Gelman (2003) used a qualitative case study method and

examined what media reported about highly visible instances of academic plagiarism.

They reported that the news media tended to sensationalize accusations of plagiarism

by public figures that included prominent scholars. They concluded that student

perceptions of faculty and scholars who reap the rewards from plagiarisms predicted

student plagiarism levels.

Martin, Rao, and Sloan (2011) studied 158 students using one assignment

submitted by each of the students to Turnitin. The focus of their study was to predict

plagiarism based upon participant demographic data. Using a 3% plagiarism threshold,

they found 74% of the students plagiarized. They also used ethnic markers for

Caucasian and Asian for predicting plagiarism and found no differences that ethnicity

explained. However, acculturation, or time spent exposed to a specific culture, did show

some linkage to plagiarism. They also concluded that just knowing one’s assignment

would be examined for plagiarism was not a strong deterrent. However, they supported

the premise that having a student reviewing their COA report was an effective deterrent.

Perfect, Defeldre, Elliman and Dehon (2011) examined age and its effects on

plagiarism levels. Their study collected data on a balanced sample of young (n = 32, M

= 22.72 years, SD = 2.44) and older adults (n = 32, M = 65.22 years, SD = 4.46). While

the researchers anticipated that age predicted rates of plagiarism, they found no such

effect.

Honig and Bedi (2012) examined 279 papers by 636 authors who presented at

the International Management Division of the 2009 Academy of Management

conference. They found that 25.44 % of the corpus (71 papers) they reviewed had some

level of plagiarism. Furthermore, they found that 13.6 % of the corpus (38 papers) had

significant levels (5% or more of plagiarized content). Using their broad access to

participating conference authors and institutional demographics Honig and Bedi (2012)

found that newly institutionalized (periphery) countries, untenured authors, versus

tenured authors, authors degreed from non-English speaking countries and male

gender predicted higher rates of plagiarism.

Ison (2012) used Turnitin to examine 100 dissertations (average 180.4 pages)

from predominantly online institutions from the years 2009 to 2011. He reported that the

dissertations had a mean DSI of 15.1%, with a standard deviation of 12.0%, and a

range of 2% to 81%. The theme of his study was that plagiarism was prevalent at online

institutions. However, he did not investigate that theme until 2014. Ison (2014) in a

follow-up study measured plagiarism statistics from 368 dissertations. The corpus

consisted of 184 dissertations from online institutions and 184 dissertations from brick

and mortar institutions for the period 2009 to 2013. He reported that there was no

statistically significant difference between the DSI of dissertations authored by students

at online institutions and the DSI of dissertations authored by students at traditional

institutions.

Youmans (2011) studied Turnitin results for students who were aware that the

instructor was submitting their work to Turnitin and those who were unaware of that

process. He concluded that just informing students that their work would be examined

with plagiarism detection software was inconsequential in preventing plagiaristic student

activities. However, Heckler, Rice, and Bryan (2013) studied Turnitin results for students

who were required and those not required to submit their assignments through Turnitin.

They concluded, “The fact that there were lower rates of plagiarism when students knew

they were being monitored suggests the detection system was an effective prevention

strategy” (p. 243).

Sun (2013) examined 600 STEM and Social Science articles using Turnitin. He

identified adjusted document similarity indexes that ranged from 0% to 48% using a 30-

word exclusion parameter. However, Sun went beyond simple descriptive analysis and

explored the data using regression techniques. Sun used journal categories, disciplines,

levels of author counts (1-2 & 3 or more), and official languages for predictors in a

negative binomial regression analysis. Sun found that journal categories and disciplines

did not predict rates of plagiarism. Sun found that single or dual-authored articles had

higher similarity scores than plus three multi-authored articles. Sun also discovered that

“whether or not an author was living in a context in which English is an official language

was not significantly associated with their Turnitin score” (p. 268).

Thomas and de Bruin (2014) examined 371 articles published in South African

journals for content similarity using Turnitin. They found document similarity indexes

ranging from 1% to 91% with a mean of 17.10% (SD = 12.15%). They reported that

almost 50% of the documents had serious evidence of plagiarism (10%-14%), with

27.2% of the documents evaluated being high (15% to 24%) and 21.3% being

excessive at equal to or greater than 25%. They also examined the author count and

found a statistically significant prediction (F = 9.6, df1 = 2, df2 = 115, p = 0.0001)

between DSI for author counts less than three and those three or more. Post-hoc

testing revealed that DSIs for one (M = 15.75, SD = 6.76) and two author teams (M =

15.42, SD = 7.08) were statistically significantly higher than larger groups of authors (M

= 10.65, SD = 4.28).

Research Method: While various studies have examined author personality traits,

Journal affiliation, and authors’ native language to predict plagiarism outcomes, I found

no research examining the predictive effect that a study’s research method has on

plagiarism levels. Creswell (2014) believed that research methods chosen by

researchers were based upon what he refers to as a researcher’s “world view” or “a

general philosophical orientation about the world and the nature of research that a

researcher brings to the study” (p. 6). Gibelman and Gelman (2003) investigated how a

plagiarizer’s perception of academia influences plagiaristic activities. They cited the

Chronicle of Higher Education’s 2002 headline “Corruption plagues academe around

the world.” (2002, p.32). Gibelman and Gelman (2003) drew the conclusion that

Media revelations of plagiarism suggest that this occurrence has a perceived

payoff for those who risk it. Students rationalize the reasons for their misdeeds

as time-pressures, conflicting obligations, ‘everyone is doing it’ or, simply, the

information copied was there for the taking. (p. 245)

If Creswell’s “world view” is tainted by the Gibelman and Gelman plagiarism’s

“perceived payoff,” is there a possibility that a selected research method (or lack of)

could predict plagiarism? Therefore, this study used the research method (quantitative,

qualitative, and others) as a predictor variable.

Year of publication: Applying White and Arzi’s (2005) definition of longitudinal

research design, “A longitudinal study is one in which two or more measures or

observations of a comparable form are made of the same individuals or entities over a

period of at least one year” (p. 138). I examined the Turnitin COA report data (entity)

across the reported years of publication. While I did not find any comparable

longitudinal research, this knowledge could provide an understanding of any potential

trends in COA technology, prevention strategies, or ethical commitments, all of which

could lead to further research.

Author counts: Previous research has investigated author counts. Sun (2013)

examined author counts and found no prediction for plagiarism levels. However,

Thomas and de Bruin (2015) found a statistical difference in plagiarism rates between

articles with one or two authors as opposed to articles with three or more authors. This

study could not investigate author counts as a predictor for the dissertations as the

author count is always one author.

Word counts: Weisgrau (2011) teaches her students the importance of brevity

(concise and short) for effective writing. Moreover, Humphreys and Klein (2006) found

that word counts can predict certain outcomes. They found that in a study of social

media for online support groups, the word counts were strong predictors of hierarchical

conversations (depth). I examined word counts as a predictor of plagiarism within each

corpus.

Reference counts: Few studies explored reference counts. Bettencourt and

Houston (2001) examine the relationship between article method type and subject area

against the diversity of the references. Gipp, Meuschke, and Breitinger (2014) used

Citation-based Plagiarism Detection (CbPD) algorithms to detected plagiarism. They

explain:

CbPD algorithms consider citation proximity, overlap, order, frequency, and

distinctiveness to varying degrees to cover the possible citation pattern

rearrangements that can occur for different plagiarism forms. (p. 1530)

Their research demonstrated that this form of plagiarism detection was “computationally

more efficient than character-based approaches” (p. 1527). This technique lends itself

to the question can the reference count predict plagiarism?

Reviewing Plagiarism Software

While it is beyond the scope of this study to test and compare available

plagiarism software packages, it was important to at least acknowledge what is

available. I started with a research report that Technavio authored. Technavio identifies

themselves as "a leading global technology research and advisory company" (2016,

n.p.). Using market penetration formulas, Technavio compiled a list of the top eight

plagiarism detections software vendors (2016). Following are the synopses that

Technavio provided as well as any observational notes I have added.

Academicplagiarism: "Is an online software solution for individuals and

educational institutions to help them detect plagiarism. It offers editing and proofreading

services by academic experts and educators. It helps check web pages, books and

magazines, academic publications, and large databases of papers" (Technavio, 2016,

n.p.). I reviewed Academicplagiarism website (https://academicplagiarism.com/) and

found that they provided plagiarism detection web-services for all levels from the free-

limited versions to full-featured services. The posted interface resembles that of the

Turnitin COA report.

Grammarly: "Through its renowned product offering the Grammarly Editor helps

identify spelling and grammatical errors on the spot and rectifies them. With its writing

app, it ensures students that their content is easy to read, effective, and error free

through checks for contextual spelling mistakes, over 250 common grammar errors, and

vocabulary use. Its products are compatible with MS Office, plagiarism checker, and

native apps" (Technavio, 2016, n.p.). I tested the premium edition of Grammarly with the

plagiarism checking option. I was able to check pdf or Word documents for document

similarities. Mixed with its grammar checking capabilities, this add-in application

provided a document DSI. It also underlined individual content similarities and

provided a source link for further investigation. However, I found as one excludes false

positives, Grammarly does not update the DSI or provide a separate aDSI. Grammarly’s

plagiarism detection option was designed for the author. Teachers and Editors may find

that using Grammarly on a large scale is too demanding of their time.

PlagScan: "Is a browser based web service that checks the authenticity of

documents and detects plagiarism. It is compatible with all common file formats, such

as MS Word or PDF. The company offers PlagScan Pro with varied pricing options for

schools, universities, and faculty in the education sector" (Technavio, 2016, n.p.).

Plagscan's website https://www.plagscan.com/ identified free trial packages for private,

organization, and enterprise IT users.

BlackBoard: "Provides enterprise technology and solutions to the educational

industry globally. The company was founded in 1997 and is headquartered in

Washington, DC, US. Its offices are located in Europe, North America, Asia, and

Australia. The company offers a variety of solutions through latest technologies for

government, further education, business, higher education, and K12 schools. It serves

over 19,000 clients in more than 100 countries globally, which include 1,900

international institutions. It is present in the market by offering SafeAssign, an

antiplagiarism software for institutions" (Technavio, 2016, n.p.). It should be noted that

Turnitin has be integrated into Blackboard Learn for both student and instructor use.

PlagiarismDetect: "Caters to the corporate sector, education customers, and

individuals. The company offers standard and premium pricing plans to all its customers

along with free trials. It has a wide geographical presence that includes countries such

as the US, India, France, the UK, Canada, and New Zealand. The software offered can

check for documents in two languages, namely English and Spanish" (Technavio, 2016,

n.p.). The website http://plagiarismdetect.org/ provides an overview of the COA report

and instructions on how to remove false positives. The web service offers a basic COA

report and a Premium COA report that include multilevel source analysis. It appears the

charges are based on a per page submission.

EVE Plagiarism Detection System: "Provides antiplagiarism software to

education customers. The company was founded in 1997 and is headquartered in Eden

Prairie, Minnesota, US. The software, known as EVE2, assists faculty to check student

work against internet sources. It has a strong presence in the North American market"

(Technavio, 2016, n.p.). While there are many Internet references regarding EVE and

EVE2 marketed by Canexus, www.canexus.com is no longer accessible.

PlagTracker: "Caters to plagiarism detection needs of students, teachers,

publishers, and site owners. PlagTracker checks for the content that follows American

Psychological Association (APA), Modern Language Association (MLA), and Chicago

style of citations. Benefits offered to teachers include student management, custom

filter, grading system, live document view, document cross check, and tracking system"

(Technavio, 2016, n.p.). The website www.plagtracker.com provides account

management and COA services.

Turnitin: "Provides instructors with tools to engage students in the writing

process, provides personalized feedback, and assesses student progress over time. It

is used by more than 26 million students at 15,000 institutions in 140 countries"

(Technavio, 2016, n.p.). Given that this study uses Turnitin, the following is a further

brief review of the academic literature regarding Turnitin.

Heather (2010) quoted a JISC-PAS study the concluded that “Turnitin is the

global leader in electronic plagiarism detection, is a tried and trusted system and over

80% of UK universities have adopted it” (p. 3). Hill and Page (2009) reported that “the

results of our brief study indicate that Turnitin is the more accurate platform based on its

higher successful detection rate and the lower false detection rate” (p. 177). Turnitin

(2011) emphasizes the importance of their service as they stated:

Educators who employ the proper tools and technologies can significantly

mitigate plagiarism. For example, institutions with the widespread adoption of

Turnitin see a reduction in the unoriginal content of 30 to 35 percent in the first

year. By the fourth year, many institutions see levels of unoriginality fall by up to

70 percent. (p. 3)

However, Marshall, Taylor, Hothersall and Péérez-Martíín (2011) reaffirmed what

was described by Barrett and Malcolm (2005). Sutherland-Smith and Carr (2005), and

Jones (2008) as they concluded: “that the originality report generated by Turnitin can

only serve as a reference point and it takes human scrutiny to examine each incidence

of text-matching” (p. 267).

METHODOLOGY

Research Design Overview

I engineered this corpora plagiarism study for examining published HRD related

dissertations and published articles using Turnitin. This study's processes are dissimilar

from the examination of unpublished dissertations or articles. Unpublished dissertations

or articles do not carry all of the similarity 'baggage' that published works bear. That

being said, the methodologies described in this section provide basic descriptions of the

data stores (databases or spreadsheets), processes (manual or automated can be

used), and deliverables (resultant data for analysis). Figure 2 illustrates a summative

visual representation of this study including the three database objects and the four

main processes: building the corpus, collecting data on the corpus, analyzing the data,

and reporting the results. All of these processes include task objects where applicable.

I assembled and examined a balanced set of corpora from sampled dissertations

and academic publications. I extracted these documents from populations of

dissertations and academic publications representative of the broad human resource

development (HRD) field. Swanson (1995) provided a definition of HRD as “a process of

developing and unleashing human expertise through organizational development and

personnel training and development for the purpose of improving performance” (p. 208).

Moreover, Thumwimon and Takahashi (2010) outlined HRD’s three fundamental

component areas as “individual development (personal), career development

(professional), and organizational development” (p.11). Using these two HRD

definitions, I compiled a list of search filtering criteria: “human resource development”

or “organizational development” or “career development” or “training and development”

or “HRD.”

Figure 2. Corpus plagiarism study process chart. This figure illustrates the main processes and objects used in this study.

I used Turnitin to examine the individual documents for content originality or

(more commonly referred in the negative) as content similarity. Turnitin is a web-based

software service, founded in the 1990s by John Barrie, which markets itself as a

solution for identifying evidence of plagiarism (Klienfield, 2014). However, Turnitin is a

content similarity analysis (CSA) product that identifies and measures content

similarities between a submitted document and the Turnitin collection of published

documents by calculating source similarity indexes (SSI). An SSI is the identified

similarity expressed as a percentage of the submitted document that has been identified

in another document located in the Turnitin databases.

Turnitin’s quantitative factors: The DSI and supporting SSIs cannot be taken at

face value and “manual checking and human judgment are still needed” (James,

McInnis, & Devlin, 2002, p. 1). A thorough content similarity analysis requires manual

analysis of the identified content similarities by the researchers (Thomas & de Bruin,

2015). Turnitin has an exclusion process for eliminating false positives. An investigator,

upon determining the SSI is not potential plagiarism, can remove (exclude) that source

from the COA report. I examined every SSI equal to or greater than 1% for potential

exclusion. While it is not practical to examine the remaining less than 1% SSIs, this

study addressed their inclusion and exclusion using a prorating formula as described in

the study’s methodology section.

Turnitin provided two alternative ways of examining and excluding SSIs using

their Match Overview and All Sources options. The Match Overview lists the SSI

currently used in building the DSI. The All Sources option (see Figure 3) exhibited all

detected sources where Turnitin identified content similarities, including the individual

overlapping documents that contain the similarities.

Turnitin executes the exclusion option from both screens. However, excluding an

identified source from the originality report is a subjective researcher decision based on

the available evidence.

Population/Sample

The validity of this study was highly dependent upon the integrity of the corpus of

sampled dissertations and published articles. The dissertations came from the ProQuest

Dissertation and Thesis Publishing database. I chose ProQuest Dissertation and Thesis

Figure 3. Turnitin SSI interfaces for selecting SSI for examination and exclusion.

Publishing database because ProQuest headed the list of googled results from

“dissertation database” and marketed itself as a dissertation repository independent

from a university system.

As previously mentioned, I developed the search criteria using Swanson (1995)

and Thumwimon and Takahashi (2010) HRD definitions. The search terms were

“human resource development” or “organizational development” or “career

development” or “training and development” or “HRD.” By enclosing each word

combination in quotes, the results yielded a more focused corpus. For example,

“training” alone included numbers of sports preparation articles, while “training and

development” tended to focus more on HRD articles.

The search criteria targeted the dissertations’ searchable database fields (title,

keywords, and abstract) within a publishing window of 2011 to 2015. To meet the scope

of this study, I applied filters for “Full-text” and “Doctoral dissertation only.” This

ProQuest Dissertation and Thesis Publishing database query yielded a population of

910 dissertations. According to Howell (2010), a population can be any size as long as it

represents the interests of what a researcher is pursuing.

Using the Krejcie and Morgan (1970) method for determining the sample size

from a given population of 910 documents, I calculated a required minimum sample size

of 270 dissertations or about 30% of the population. I subjected each document in the

population with a random chance of inclusion so to improve the validity of the sample

and protect the anonymity of the dissertations, authors, and institutions. I used a

software application I engineered to randomly to sample a 910 population with a 270-

sample selection rate (see Figure 4). According to Moore and McCabe (1989), a

random sample adds to the external validity of a study when each member of the

population has an equal chance for inclusion in the sample.

The published articles came from the EBSCO collection of document databases

(EBSCO, 2016a). I examined several academic article databases, including Web of

Science. However, I chose EBSCO because of broad acceptance by colleges and

universities and its ability to export only articles where the publication text was available

for download. I used the same search criteria as the ProQuest query, “human resource

development” or “organizational development” or “career development” or “training and

development” or “HRD” with a publishing window of 2011 to 2015. I employed EBSCO

filters that demonstrated some equivalency to the ProQuest filters: Limit To options

“available text,” “references available,” and “peer reviewing”; Source Type options

“Academic Journals”; Database options “All Databases.” EBSCO specifies,

The default fields that are searched … include all authors, all subjects, all

keywords, all title information (including source title) and all abstracts. If an

abstract is not available, the first 1,500 characters of the HTML full text of the

article are searched. (2016b, n.p.)

The EBSCO query criteria yielded a population of 5,336 documents. Using the

Krejcie and Morgan (1970) calculations for determining a minimum sample size from a

given population of 5,336 produced a minimum sample size of 360 or about 7% of the

documents. The sample of articles was built by subjecting each document in the

population to a “random assignment” process (Howell, 2010, p. 3). Again, I used a

software application that random sample generator application set at a 360 count

sampling rate (see Figure 4).

Because this study considered both dissertations and published articles, I raised

the dissertations sample count beyond the minimum of 270 to 360 randomly selected

dissertations (about 40% of the 910 dissertations). Thus, the corpus now contained a

minimum of 720 documents for a “balanced design” between dissertations and

published articles (cf. Howell, 2010, p. 332), while still meeting the minimum sample

sizes (Krejcie & Morgan, 1970). Scholarly Research Tracker assigned all documents in

the corpus a unique document identification number.

To address corpus diversity concerns, I created reporting programs using the

SRT (Mayes, 2016) database to identify the frequencies of author submissions by

parsing the author lists. The corpus of 360 dissertations had 360 different authors. The

Figure 4. Random sampler software. This figure displays the interface used to manage the automated random sampling process. © Mayes 2017

corpus of 360 published articles had 854 authors with co-authorships ranging from 1

through 18. If the corpus was concentrated across a relatively limited number of

authors, researchers could call the validity of this study into question. Out of 852 listed

article authors, I credited only 24 with two publications while one author was credited

with three publications. These documents with duplicate authorships remained in the

study.

I then identified the academic institutions for the dissertations and academic

journals for the published articles. A compilation of institutions and publishers including

contribution frequencies were used to confirm a broad and multidisciplinary nature. The

corpus of 360 dissertations had the number of institutions totaling 156. The corpus of

360 published articles had the publisher count numbering 205. This data substantiates

the assumption of diversity and is in alignment with what Rhodes, Gelman, and

Brickman (2008) stated: “evidence obtained from more diverse sources more strongly

support a conclusion than evidence obtained from more homogeneous sources”

(p.114).

Data Collection Process 1

To control the internal validity of this study, I carefully documented and executed

the data collection process. This study collected data from two collection processes for

each of the two corpora (dissertation and published articles). The first collection process

was from the initial document retrieval and review. During the initial document review, if

I deemed the document inappropriate, I rejected the document from the sample, and

chose the next available document from the export list, thus maintaining the

randomness of selection. Reasons for rejections were limited to that the document:

Could not be retrieved

Was not a dissertation or a published academic article

Was incomplete or unreadable by Turnitin or me

I classified each of the 720 sampled documents as a dissertation or a published

article. I also reviewed the methodology section and looked for characteristics that

identified the document research design as an empirical study and the methodology

employed. I recorded the document research method using guidelines provided by

Creswell (2014) as listed below:

Empirical quantitative: If the document was defined as quantitative or there was

evidence of statistical results or calculations, the quantitative research method

was selected

Empirical qualitative: If the document was defined as qualitative or there was

evidence of content analysis, purposeful sampling, interviews or observations

qualitative research method was selected

Empirical mixed: If the document was defined as mixed or there was evidence of

quantitative and qualitative research, the mixed research method was selected;

Other: If there was no evidence of quantitative or qualitative research, then other

was selected

Moreover, I recorded year of publication, author count (author count for

dissertations was consistently one), word count, and reference count. Wilmington

(2013) provided a sample Adobe Acrobat JavaScript, which I modified to retrieve the

word counts from the collected PDF documents (see Figure 5).

However, I found one instance where this script returned an invalid word count

on a dissertation. A count of 6308 words on a 108-page dissertation seemed related to

the content using Acrobat's encryption strategy character set. I used Microsoft Word

word-count feature, which returned 27,361 document words. While recognizing this

anomaly must be an Adobe Acrobat program issue, the immediate concern was if

Turnitin had been able to deliver a valid COA report. An examination of the COA report

indicated that the report had produced a valid COA report.

Data Collection Process 2

The second data collection process was from the study’s corpus Turnitin

submissions. Before any Turnitin submission, I configured and verified settings to

exclude bibliography and quoted materials. These configured settings are in line with

what the Higher Education Commission, Pakistan (n.d.) promoted when instructing

many of the middle-eastern universities using Turnitin.

References/bibliography and table of contents must be removed from a

document which is submitted. If these are included the similarity index of

the document will be increased. (p. 3)

Additionally, the Turnitin settings used in this study were set to exclude content

similarities of less than ten words in sequence. This word exclusion feature is an

attempt to eliminate what Turnitin refers to as “trivial similarities” (iParadigms, 2011).

The Turnitin default setting for word exclusion is three words. Sun (2013) used a 30-

Figure 5. Adobe Acrobat JavaScript. This figure is the source code used in Adobe Acrobat to count and display the number of words and pages in the viewed document.

word exclusion setting. I tested all three of these word exclusion settings (3, 10, and 30)

and found the 10-word setting did eliminate additional trivial and template similarities.

However, I found that a 30-word exclusion affected the Turnitin DSI in some cases by

over 50%, bringing into question its validity. Because the corpus consisted of previously

published works, I manually removed publisher copyright pages and warnings. I found

that such common content in hundreds if not thousands of published documents

created havoc with the contents similarity verification process. One example as shown

in Figure 6 was the issue of common dissertation template text. The phrase “in partial

fulfillment of the requirements for the degree of Doctor of Philosophy” was common

across thousands of documents in the Turnitin document collection database.

Moreover, a dissertation’s table of contents and subheadings also created template

similarities. There were thousands of SSIs, which failed to provide evidence of

plagiarism. Moreover, these types of template similarities were difficult to exclude in

both dissertations and published articles.

After completing the document preparation steps, I submitted each document

using its document identifier to Turnitin for content similarity analysis to detect and mark

potential evidence of plagiarism (iParadigms, 2011). The Turnitin submission process

generated a “content originality report” that provided quantitative data and a visual

Figure 6. An example of a dissertation template.

highlighting of the areas within the submitted document that were similar to other

documents (see Appendix H).

Moving on to the direct human intervention part of a plagiarism analysis, I found

no fully documented steps in any of the previous studies for detecting false positives.

Moreover, Turnitin did not provide any particular documentation for evaluating content

similarities. iParadigms (2011) in its training manual states:

The decision to deem any work plagiarized must be made carefully, and only

after in-depth examination of both the submitted paper and suspect sources in

accordance with the standards of the class and institution where the paper was

submitted. (p. 50)

Thus, an important part of this study was to identify and document the process

used to evaluate the validity of the content originality report objectively. However, as the

Higher Education Commission (n. d.) stated, in the searching for evidence of plagiarism

an investigator must consider subjectively that “the benefit of the doubt may be given to

the author” (p. 3).

I identified two major processes that are required to adjust the DSI for corpus

research. I emphasize that checking student and author original work would require a

different set of processes and tasks that are outside the scope of this research.

Step 1: The DSI was the summation of the source similarly indices (SSI). The

SSI identified each specific source where Turnitin found a textual similarity. Turnitin

displayed the SSIs in two different formats: Match Overview and All Sources (see

Figure 4). The Match Overview shows the similarities that Turnitin has assigned to that

particular listed SSI. One can exclude SSI for the Match Overview by selecting the right

arrow for a breakdown of all related SSI that have text in common. The All Sources view

allows a researcher to view all of the sources that have similarities, even when they

contain duplicated similarities. An investigator can exclude sources from either view. I

excluded sources from the report when I suspected a false positive based upon the

preponderance of the evidence. However, often when I eliminated a source the Turnitin

statistics unexpectedly remained as they were. For instance, eliminating a source rarely

reduced the DSI by the amount attributed to the excluded SSI. That was because other

sources had the same or parts of the same similarities. This anomaly became a serious

issue when the repeated parts of a publisher’s template were common across hundreds

of articles. I used the guidelines listed below when examining SSI for a determination as

to whether to exclude them from the Turnitin reported DSI.

When an SSI source document in the Turnitin document collection database was

unavailable for inspection and the examination of the SSI content stream was

inconclusive, I excluded that SSI to prevent the false identification of content similarities

that provided evidence of plagiarism (cf. Stevens, 2009). However, listed below are the

Process 1 tasks for detecting false positives (not evidence of plagiarism) while the DSI

remained equal to or greater than 15% or when any SSI are equal to or larger than 1%.

1. Record the starting Turnitin DSI.

2. Select “All Sources” with available sub-sources.

3. Starting with the largest unexamined SSI percentage, examine the source

document or DCD text stream for identified similarities. Exclude duplicate

posting of the same document and marketing references to the title and

abstract (author list and date of publication are essential for a concise

evaluation.

4. If needed, examine the sub-sources for further clarification.

5. Exclude the similarity source if there is no evidence that the similarities are

plagiarism including trivial and templates similarities.

6. If the exclusion does not reduce the DSI to a level below 15% or there are

SSI 1% or greater repeat process 3 through 6.

7. When finished with the exclusion process, record the current DSI (the

beginning DSI minus exclusions) and from the Match Overview interface then

summarize the SSI greater or equal to 1% and record as the aDSI.

8. Proceed to the next Turnitin content originality report.

9. Repeat Steps 1 through 7 until corpus examination is complete.

Step 2: While Turnitin cannot identify the types of plagiarism, this study outlines a

method of identifying plagiarism-of-others as opposed to plagiarism-of-self by examining

all of the SSI equal to or greater than 5% that have not been previously excluded. I

examined SSI for the following two conditions:

Plagiarism-of-Others: If the author(s) were different and the source document

had an earlier publishing date, the SSI was identified as plagiarism-of-others.

Plagiarism-of-Self: The SSI was considered as evidence of plagiarism-of-self

If there were any author(s) common to both documents and the source document:

• Was not the author’s dissertation, thesis or other student’s papers

• Was not a non-public published conference proceedings

• Was not a grant project document or presentation

• Was not a duplicate web posting of the same document or abstracts

• Did not contain a note that the submission is a duplicate or variation of the

original manuscript with appropriate publisher permissions or releases

This processing of plagiarism type identification could be integrated into the

previous step when identifying exclusions. However, I chose to perform step 2 following

the completion of step1. By separating the process into two steps, I occasionally found

and corrected errors in the previous results from step 1, thus increasing the internal

validity of the study. See Figure 7 for a visual representation of the following process

descriptions in Step 1 and Step 2. The data collected from Step 1 and Step 2 were

inputted in a researcher developed SRT interface (See Figure 8).

Figure 7. Flow diagram for verification of Turnitin content originality report.

Data Export Process

After collecting all relevant data, SRT created several CSV export files adding

several derived or synthetic variables essential to the analysis. Because of the wide

range of the word counts, it became apparent the 10% of a 100,000-worded document

was not the same as 10% of a 10,000-worded document. While the scale of choice for

Turnitin is solely based upon percentages, this study also included the word counts

associated with these percentages. During the SRT data export process, the SRT

application calculated the DSW and aDSW.

Figure 8. SRT Turnitin COA report data collection interface.

While Turnitin does not provide a method for summarizing the content similarities

that were identified as less than 1%, SRT estimated their contribution to the aDSI by

subtracting the sum of the remaining SSI equal to or greater than 1% from the current

document similarity index (the starting DSI less any exclusions). Moreover, based upon

the percentage of posted exclusions, SRT interpolated how much of the SSIs less than

1% would be legitimate similarities and a part of the aDSI. Below is the formula I used in

SRT on each document (d) with an example that illustrates how it was applied:

𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑑𝑑 = �𝑎𝑎𝑎𝑎𝑎𝑎(=>1)𝑑𝑑 + � 𝑒𝑒𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑎𝑎𝑎𝑎𝑎𝑎𝑑𝑑

× �𝑎𝑎𝑎𝑎𝑎𝑎(<1)𝑑𝑑�

For example, document (d) has a Turnitin reported DSId value of 45%. After the

investigator had excluded several SSId believed not to be evidence of plagiarism, the

DSId was reduced to a new value eDSId of 10%. That produced a ratio of 10/45 or 22%

of the summation of all of the Turnitin reported SSI as evidence of plagiarism. However,

if the investigator summarizes the remaining SSI(=>1)d equal to or larger than 1% and

finds they add up to 6%, the remaining 4% are the sum of the SSI(<1)d smaller than 1%.

The ratio of 10/45 or 22% was applied to the 4% producing .88% to be then added to

the 6%. This document then had an estimated 6.88% or rounded up to 7% of its content

as similarities that were evidence of plagiarism. I used a spreadsheet formula to double-

check the accuracy of the automated SRT formula (see Figure 9).

I used the SRT software (Mayes, 2016) to export the data collected into CSV files

for importation into IBM’s Statistical Package for the Social Sciences (SPSS) Version

20. During the export process, I used an SRT automated process that used the aDSI

values to assign DSL membership values. I addressed each research question using

SPSS for the analysis and R-Studio for graph and charts. I saved both the data files and

the syntax files for further reference.

Data Collection Summary

In review, the Turnitin DSI was the most important or visible variable provided by

Turnitin. However, the primary quantitative dependent variable in this study was the

adjusted document similarity index (aDSI). The aDSI was the sum of the remaining non-

excluded SSI equal to or greater than 1% plus a proportional amount of the remaining

SSI that are less than 1% as previously described.

I used the aDSI to generate the categorical document similarity level (DSL)

values. The DSL was categorically modeled using three membership levels of aDSI

based upon previous research and instructional documents (see Table 2.). Using

ordinal categories, I denoted the first membership level as acceptably low or moderate

aDSI with a range 0% to 14%. The second membership level was denoted as high

using aDSI with a range of 15% to 24%. The last membership level was denoted as

excessive using aDSI with a range equal to or greater than 25%.

Figure 9. Spreadsheet formulas for double-checking SRT-DSI final adjustments.

Data Analysis

To reiterate, this study analyzed data from corpora including a dissertation

corpus and a published article corpus. While the dissertation corpus and the published

article corpus were from the HRD field, they had many distinct differences. First,

dissertations always had one author, while this sample of published articles had up to

eighteen authors. While dissertations primarily employed empirical research methods,

published articles included editorials, theory development, and other areas of

professional interests. Moreover, it was common for dissertations to have over 100,000

words while published articles rarely exceeded 10,000 words. With these large

differences, statistical significance in Corpus comparisons could be difficult to interpret.

I addressed both RQ1.1 and RQ1.2 through SPSS and R descriptive statistics

that included distributions, measures of central tendencies, frequencies, and Pearson r

correlational statistics that identified relationships between the variables: author, words,

and reference counts (Bedeian, 2014). Next, I examined the data for outlying conditions

and abnormalities that may indicate inappropriate sampling, or a problem as simple as a

mistake in entering or coding data, (Mullen, Milne & Doney 1995). However, as Grace-

Martin (2016) states: “It is NOT acceptable to drop an observation just because it is an

outlier. They can be legitimate observations and are sometimes the most interesting

ones” (n.p.). Retaining outliers is certainly prudent in plagiarism studies.

For the continuous variables DSI, DSW, aDSI, and aDSW, the mean, standard

deviations and ranges, I used the descriptive statistics features of SPSS for the two

corpora of dissertations and published articles and various subpopulations. For the

continuous variables: author, word, and reference counts, the means and standard

deviations I again used the descriptive statistics features. For the polychotomous

variable research methods (quantitative, qualitative, and other) and for the DSL

dependent variable I conducted frequency and subpopulation descriptive statistics.

Both RQ2.1 and RQ2.2 were important research questions. Researchers have a

need to know if the Turnitin provided DSI values are credible or are they significantly

different from researcher generated aDSI values. Researchers can expend a significant

amount of resources to calculate aDSI values. If there is a statistically insignificant

difference, that investigation process may be unimportant.

I used a Wilcoxon Signed-rank statistic, to analyze the statistical difference or

lack of, between the two dependent variables. Researchers often use the paired t-test

statistic for this type of statistical analysis, but the paired t-test has the assumption of

normality on the difference between the two dependent variables. However, our pilot

study indicated that the data difference histogram was bimodal and heavily skewed. We

confirmed this abnormality condition using the Shapiro-Wilk test for normality. Both the

dissertations and published articles have fewer than 2000 elements, so I chose the

Shapiro-Wilk, over the Kolmogorov-Smirnov test for normality. For either test, if the

results showed a statistical significance (p < .05) the data would not meet the

assumption of normality on the difference between the two dependent variables.

The Wilcoxon Signed-rank sum statistic with α = .05 was substituted for a t-test

analysis. The Wilcoxon signed-rank statistic works with the variables distribution

elements in order of size and then determines the difference between the two ranked

distributions. Another difference is that the Wilcoxon signed-rank statistic is dependent

upon the median value of the ranked distribution. Box plots are normally used with

Wilcoxon Signed-rank statistic (Massart, Smeyers-Verbeke, Caprona & Schlesierb,

2005). According to Zimmerman (1996):

The Wilcoxon test is more powerful for various non-normal distributions with

excess skewness and kurtosis, including mixed-normal, exponential, Cauchy,

and Laplace distributions (Hodges & Lehmann, 1956; Randles & Wolfe, 1979).

The comparison does not imply that violation of the normality assumption has no

influence on the Wilcoxon test. On the contrary, the power of the test to detect

alternatives declines, despite maintenance of the significance level. But the

power of the t-test declines even more, so the nonparametric method acquires an

advantage. (p.29)

The Wilcoxon signed-rank sum effect size (r) is widely used and understood

(Rosnow & Rosenthal, 2005). The Wilcoxon signed-rank sum effect size (r) is based

upon the calculated Z score divided by the square root of the number of data elements

(Field, 2011, pp.26 & 558). The key to the symbols: 𝑍𝑍 = Z score, 𝑋𝑋 = mean of difference

between the conditions, s = standard deviation, N = number of cases.

𝑧𝑧 = 𝑋𝑋−𝑋𝑋𝑠𝑠

𝑟𝑟 = 𝑍𝑍√𝑁𝑁

As with the t-test, the Wilcoxon Signed-rank sum effect size (r) is interpreted

much like a Cohen’s d. Olive and Franco (2008), as cited in Cohen (1988), suggested

using a Cohen’s d = 0.2 is a small effect size, d = 0.5 is a medium effect size and d =

0.8 is a large effect size. Rosnow and Rosenthal further state a Wilcoxon Signed-rank

effect size estimate r reported value of 0.5 and above is large (2005).

reference count predict membership in low, high, or excessive levels of the plagiarism

Both RQ3.1 and RQ3.2 employed the ordinal DSL as the dependent or outcome

variable. However, while RQ3.1 employed year of publication, research method, word

count, and reference count as four independent or predictor variables, RQ3.2 also

included author count. I addressed the analysis of these variables with a multinomial

logistic regression (MLR) analysis, modeling the three levels of content similarity in what

was referred to as the dependent, respondent, or outcome variable. Multinomial logistic

regression (MLR) is a not a member of the generalized linear models (GLM), but rather

uses a logarithmic transformation to change categorical data in a linear way (Bham,

Javvadi & Manepalli, 2012; Petrucci, 2009; Berry & Feldman, 1985). However, MLR’s

strength is that both continuous and categorical independent variables can be used to

predict membership in a categorical dependent variable (O’Connell & Rivet Amico,

2010). The multinomial dependent variable was chosen over a binomial dependent

variable because it was based upon the three categories of similarity levels as defined

in the merger of several standards from recommendations such as Thomas and Bruins

(2014), the Higher Education Commission, Pakistan, (n.d.) and others. The standards

used in the analysis for this study were low (0% to 14%), high (15% to 24%) and

excessive (equal to or greater than 25%).

I did consider predictive discriminant analysis (PDA), but because of the potential

mixture of continuous, dichotomous, and polychotomous independent variables, MLR

was the final choice (Petrucci, 2009). Moreover, MLR in SPSS is easier to use and

more robust than PDA (Burns & Burns, 2009) and uses odds ratios for measuring

predictability. Odds ratios for each independent variable look at the sample size and the

total number of the independent variable cases that succeeded in changing the

dependent variable (IDRE-UCLA, 2017A). Odds ratios are easier to understand and to

make comparisons. For example, MLR is often used to predict buying decisions such as

“Do women prefer SUV vehicles over sedans and trucks?” Using a multinomial logistic

regression with the dependent variable being the purchase of an SUV or sedan or truck

and the independent variable would be vehicle purchases decision maker being a

male=0 and female=1. Based on the historical data, a multinomial logistic regression

could show that if a female made a purchase, she would be 6.5 times more likely to buy

an SUV over a sedan and 29.6 times more likely to buy an SUV over a pickup.

The collected data was examined for statistical assumptions as required for MLR

model analysis. First, each variable must have a single value per case. By design, I did

not have duplicate cases. Also, the MLR model assumes that there are no independent

variables, which can perfectly predict the dependent variable. While several of

independent variables were categorical, there were continuous variables. Thus I did

have a reason to test the assumption for linearly related to the logit (Field, 2011).

Moreover, over-dispersion is a serious issue if the assumption of independence of

errors is not met (Field, 2011). Field (2011) stated that while independent variables do

not have to be statistically independent of each other, there is an assumption that

multicollinearity is relatively low in both linear and logistic regression. High levels of

predictor multicollinearity can affect the results. For example, if word count and

reference count are highly correlated, that may affect the logistical regression statistics.

I tested word counts and reference counts for multicollinearity. There were no

multicollinearity issues.

Sample size guidelines for multinomial logistic regression indicate a minimum of

10 cases per independent variable (Schwab, 2002). That guideline was exceeded with

the 360-document sample. Field (2011) and Anderson (1984) warn that empty cells or

zero cases in a subpopulation are an issue with logistic regression. I found that several

research methods did not fully populate the subpopulation cells. I combined them into

the "Other" class within the research method variable. That eliminated the empty cells

without reducing the sample size.

I conducted the multinomial logistic regression analyses for each level in the

dependent variable categories (low, high, and excessive) as the specified reference

base. Moreover, I conducted the MLR analyses for all possible subsets of the

independent variables. The resulting Nagelkerke's Pseudo R2 values explained the

importance of the predictors in terms that closely behave like a linear model (Allison,

2013). Bewick, Cheek, and Ball (2005) explained that Nagelkerke's Pseudo R2

demonstrates “how useful the explanatory variables are in predicting the response

variable and can be referred to as measures of effect size” (p. 116). These subset

analyses provided the data needed to determine the best combination of the

independent variables in the final MLR model.

RESULTS

This chapter contains the data analysis results from the dissertation corpus,

followed by the results from the published article corpus. I included the research

questions followed by the respective results. Various tables and brief explanations

present the results for reader review.

The following is a quick review of the pertinent acronyms, starting with the

document similarity index (DSI). The DSI is a summation of the applied source similarity

indices (SSI) as reported by Turnitin before any researcher exclusions or adjustments.

The adjusted document similarity index (aDSI) is the DSI less any researcher

exclusions or adjustments. The substantive SSIs are those SSIs equal to or larger than

5%. The document similarity level (DSL) is the dependent variable derived from aDSI

range values. I derived the document similarity words (DSW) and adjusted document

similarity words (aDSW) from the individual document DSI and aDSI (both percentages)

as applied to the document word count variable.

Dissertation Descriptive Statistics Results

I conducted a descriptive statistical analysis on the corpus of 360 HRD related

dissertations (see Tables 4, 5, and 6). Table 4 exhibits the means statistics for relevant

variables across the corpus of dissertations. Most important are the dissertation aDSI

values of M = .09 (SD = .06). Only nine of the 360 dissertations had any non-excluded

substantive SSIs. Thus, the SSI means were reported at extremely low levels. Most

noticeable are the large kurtosis values for many of the variables.

Removal of outliers is a common way to add normality to the sample, but as

Grace-Martin (2016) stated, “It is NOT acceptable to drop an observation just because it

is an outlier. They can be legitimate observations and are sometimes the most

interesting ones” (n.p.).

Noticeably absent in Table 4 and 5, is any reference to substantive plagiarism-of-

self SSI data in the dissertations. I did not find any substantive plagiarism-of-self SSI

data in the review of the SSI. However, I did find one 4% plagiarism-of-self SSI value. I

Table 4

Corpus Wide Descriptive Statistics for Dissertations (n=360)

Variable Min Max M SD Skewness S.E. Kurtosis S.E.

Year of Publication 2011 2015 2013 1.36 .20 .13 -1.19 .26

Author Counta 1 1 1 .00 .00 .00 .00

Word Count 13160 155389 48877 22833 1.34 .13 2.17 .26

Reference Count 1 602 140 76 1.69 .13 5.19 .26

DSI .01 .98 .27 .28 1.45 .13 .60 .26

DSWb 822 141404 13434 19018 3.05 .13 11.18 .26

aDSI .00 .40 .09 .06 1.29 .13 3.13 .26

aDSWb 0 19990 3828 2484 1.54 .13 3.76 .26

SSI Other Countc 0 3 .04 .31 8.11 .13 69.59 .26

SSI Other Similarity Mc .00 .16 .03 .02 7.43 .13 56.17 .26

SSI Other Wordbc 0 4980 73 496 7.43 .13 58.05 .26 aAuthor Count for dissertations was always 1. bDSW, aDSW and SSI word counts are whole word similarities . cSubstantive SSI or SSI equal to 5% or larger.

mention this because it is important to demonstrate that dissertations are not immune to

plagiarism-of-self issues.

A basic understanding of the relationships between the variable data collected in

a study is important and a precursor to any predictive statistics (Field, 2011). Normally,

a Pearson's r correlation coefficient is the statistic used to identify the relationship

between two variables. However, a Pearson's r correlation coefficient statistic has an

assumption of data normality (Field, 2011). I tested all of the variables using a Shapiro-

Wilk test for normality. All of the variables violated the assumption of normality (p < .05).

There are two camps on the importance of the assumption of normality for a Pearson’s r

statistic. Nefzger and Drasgow (1957) posited, "Tests of significance of r do not in

practice require normally distributed variates" (p. 623). However, Field (2011) stated

that the Spearman's rho Correlational Coefficient Statistic is a non-parametric statistic

that is used for better results when the data violate the assumption of normality. Table 5

identifies the relationships between the variables using the Spearman's rho correlational

coefficient statistic. The results show a very strong correlation between word counts and

reference counts and very high correlations between the various similarities

percentages and word counts, including a 1 to 1 correlation with substantive SSI of

others between the percentages and words.

Table 5

Corpus Wide Spearman's rho Correlational Coefficient Statistic for Dissertations (n=360)

Variable 1 2 3 4 5 6 7 8 9 10 11

1 Year of Publication 1.00

2 Author Countb - 1.00

3 Word Count <.01 - 1.00

4 Reference Count <.01 - .49a 1.00

5 DSI -.06 - -.13a .13a 1.00

6 DSW -.07 - .32a .34a .87a 1.00

7 aDSI <.01 - -.28a .04 .14a .04 1.00

8 aDSW <.01 - .27a .30a .04 .22a .81a 1.00

9 SSI Other Count -.03 - -.17a -.07 .18a .13a .25a .16a 1.00

10 SSI Other Similarity M -.03 - -.17a -.07 .18a .13a .25a .16a 1.00a 1.00

11 SSI Other Word Count -.04 - -.17a -.07 .18a .13a .25a .16a 1.00a 1.00a 1.00 aCorrelation is significant at the p < 0.05 level (2-tailed)

Continuing forward to Table 6, I provided additional descriptive statistics for the

various subpopulations in several of the continuous and categorical variables. Table 6

provides variable and subsample frequencies including aDSI and aDSW statistics by

year, research method, author count (one author for dissertations), and DSLs.

For additional insight, I calculated the aDSI and the aDSW within each

subpopulation. While this information did not answer any particular research question, it

provided the ability to observe the diversity of group memberships and congruence in

calculated values. Moreover, the data visually demonstrated potential prediction issues.

Table 6

Descriptive Statistics for Dissertations Subpopulation (n = 360)

Variable Sub-Populations

Freq. Percent aDSI (range) SD aDSW SD

Year of Publication 360 100.0 .09 .09 3828 2484

2011 90 25.0 .09 .07 4171 3638

2012 78 21.7 .08 .05 3904 2659

2013 76 21.1 .07 .05 3446 2538

2014 71 19.7 .09 .06 3817 2729

2015 45 12.5 .09 .05 3816 2272

Research Method 360 100.0 .09 .09 3828 2484

Quantitative 138 38.3 .11 .06 4058 2329

Qualitative 160 44.4 .07 .05 3697 3239

Other 62 17.2 .08 .06 3685 3011

Author Counta 360 100.0 .09 .09 3828 2484

1 360 100.0 .09 .09 3828 2484

Document Similarity Levels 360 100.0 .09 .09 3828 2484

Low 317 88.1 .07 (.00-.14) .04 3277 2339

High 35 9.7 .18 (.15-.24) .03 7350 2287

Excessive 8 2.2 .30 (.26-.40) .05 10486 5119

aAuthor Count for dissertations was always 1.

Regarding dissertations, I noticed three issues: a) The quantitative research method

documents returned the highest aDSI, b) there were no plagiarism-of-self substantive

SSI, and c) DSL membership was highly skewed to the low level. The research method

identified as 'Other' was a catchall that included the Ph.D. candidates' descriptions of

their dissertations as a mixed method, literature review, an action project, an

organization review, and an untested survey design. See Appendix A for a further

dissertation frequency breakdown of the research methods categorical variable within

each DSL, word count groupings, and reference count groupings.

Dissertation Differential Statistics Results

An important part of this study involved the Turnitin DSI and the importance of

making adjustments to or exclusions from the DSI. This lengthy process included many

manual validation procedures (see Methodology section). RQ2.1 and RQ2.2 provided

the justifications that these processes were required. It is important to know whether the

differences between the DSI and the aDSI are statistically and practically significant for

both the dissertation and the published article corpora.

Researchers often use the paired t-test statistic to test for differences between

pairs of data. In this study, the intervention was the removal of SSI false positives.

However, a paired t-test assumes data normality of the difference between the DSI and

aDSI variables.

Figure 10 employed histograms, which provided the visual differences between

the dissertation values of the Turnitin reported DSI and my reported aDSI. The left and

right histograms were heavily skewed to the left. The left DSI histogram exhibited minor

bimodal plotting. These histograms indicated the presence of non-normality

distributions.

Moreover, the observed statistical difference between the DSI and the aDSI

reported a skewness statistic of 1.59 (SE = .129), Z = 12.32 and a kurtosis statistic of

.852 (SE = .256), Z = 3.33. Figure 11 in combination with these statistics demonstrate

that the difference potentially violates the t-test assumption of normality (Cramer, 1998;

Cramer & Howitt, 2004).

Figure 10. Dissertation DSI & aDSI frequencies before and after the adjustments.

The black bars represent the Document Similarity Levels separators for Low, High, and Excessive.

To confirm the violation of the assumption of normality I conducted a Shapiro-

Wilk Test for Normality on the difference to meet or reject the assumption of normality

(Shapiro & Wilk, 1965; Field, 2011). I chose the Shapiro-Wilk Test for Normality over the

Kolmogorov-Smirnov Test for Normality because the Shapiro-Wilk Test for Normality is

applicable for matched pairs. The name of this test is somewhat misleading in that it

tests for the presence of data abnormality. The results reported a statistically

significance statistic of .629 (df = 360, p < .05). Thus, the data tested did meet the test

for the presence of data abnormality, which did not meet the assumption of normality

that is preferred for a t-test.

Figure 11. Histogram exhibiting difference between dissertation paired DSI & aDSI.

Based upon the non-normality of the data, I conducted a Wilcoxon Signed-rank

statistic to identify the statistical significance of the difference (Zimmerman, 1996; Field,

2011). As previously discussed in the Methodology section, the Wilcoxon Signed-rank

non-parametric statistic ranks the data sets (large to small) and then compares and

measures the difference much like a t-test. According to Zimmerman (1996):

The Wilcoxon test is more powerful for various non-normal distributions with

excess skewness and kurtosis, including mixed-normal, exponential, Cauchy,

and Laplace distributions (Hodges & Lehmann, 1956; Randles & Wolfe, 1979).

The comparison does not imply that violation of the normality assumption has no

influence on the Wilcoxon test. On the contrary, the power of the test to detect

alternatives declines, despite maintenance of the significance level. But the

power of the 't-test' declines even more, so the nonparametric method acquires

an advantage. (P. 29)

The Wilcoxon Signed-rank test confirmed the significance of the difference

between the DSI and the aDSI Z = -14.214 (p < .05). The Wilcoxon Signed-rank effect

size estimate r value of .52 (p < .05) provided strong support that these adjustments

were statistically and practically significant (Field, 2011; Rosenthal, 1991, p.19).

Rosnow and Rosenthal stated a Wilcoxon Signed-rank effect size estimate r reported

value of 0.50 and above is large (2005).

Dissertation Predictive Statistics Results

reference count predict membership in low, high, or excessive levels of the plagiarism

I conducted a multinomial logistic regression (MLR) analysis to predict

dissertation membership outcomes in the DSL. The goal of this MLR analysis was to

determine what independent variables or a combination of, influenced the probability of

a document belonging to one of the three DSLs. As previously discussed, the DSL is a

multinomial dependent variable based upon three categories using the aDSI with the

following membership criteria:

1. Low: Less or equal to 14%

2. High: Between 15% and 24%

3. Excessive: Equal or greater than 25%

Overall, within the corpus of 360 dissertations, SRT identified 317 cases or

88.1% in the low plagiarism range, 35 cases or 9.7% in the high plagiarism range, and 8

cases or 2.2 % in the excessive plagiarism range (See Figure 12). However, this

skewed DSL membership also caused subpopulation membership issues. The

document research method “Mixed” had only two cases in the high DSL and one case

in the excessive DSL. The document research method “Other” had one case in the high

DSL and zero cases in the excessive DSL (see Appendix A). Field (2011) and Anderson

(1984) warned that empty or minimally populated cells are an issue with logistic

regression. After careful consideration, I moved the research method "mixed" cases into

the research method "other" cases to eliminate the empty cells, thereby improving the

model fit.

An important part of the MLR statistic process is configuring the best model. I

used the Nagelkerke's Pseudo R2 to design the strongest model. The Nagelkerke's

Pseudo R2 is a model based pseudo effect size statistic used in MLR to explain the

importance of the predictors in terms that closely behave like a linear model (Allison,

2013). Bewick, Cheek, and Ball (2005) explained that Nagelkerke's Pseudo R2

demonstrates, “how useful the explanatory variables are in predicting the response

variable and can be referred to as measures of effect size” (p. 116). To engineer the

best model, I used a Word VBA script that listed out the all-possible-subsets (see Figure

Figure 12. Dissertation membership in DSL groups.

13). I then conducted an MLR on each of the possible variable combinations using

While the Nagelkerke's Pseudo R2 is valuable in MLR for determining the best

MLR model, it should be noted the Nagelkerke's Pseudo R2 "cannot be interpreted

independently or compared across different datasets" (IDRE-UCLA, 2017B, p.7). With

the available predictors arranged in all of the possible combinations, the highest

Nagelkerke's Pseudo R2 of .169 was achieved using a model of Year of Pub, Research

Method, Word Count and Reference Count (see Table 7).

Figure 13. MS Word VBA script for building all possible string subsets. This script creates a new MS Word document with the full listing of the subsets.

Table 7

Modeling MLR Analysis of Sampled Dissertations

Classification Percentages Effect size (Pseudo R2)

Low High Exces-sive

Total Cox & Snell

Nagelkerke Mcfadden

Research Method Year of Pub Word Count Reference Count

99.7 0.3 0.0 87.8 .097 .169 .120

Research Method Word Count Reference Count

99.7 0.3 0.0 87.8 .091 .159 .112

Year of Pub Word Count Reference Count

99.7 0.3 0.0 88.1 .082 .143 .101

Word Count Reference Count

100 0.0 0.0 88.1 .078 .136 .095

Research Method Year of Pub Reference Count

100 0.0 0.0 88.1 .064 .111 .078

Research Method Year of Pub Word Count

100 0.0 0.0 88.1 .056 .097 .068

Research Method Reference Count

100 0.0 0.0 88.1 .056 .098 .068

Research Method Word Count

100 0.0 0.0 88.1 .048 .084 .058

Research Method Year of Pub

100 0.0 0.0 88.1 .050 .088 .061

Research Method 100 0.0 0.0 88.1 .042 .073 .051

Year of Pub Word Count

100 0.0 0.0 88.1 .027 .048 .033

Word Count 100 0.0 0.0 88.1 .022 .039 .027

Year of Pub Reference Count

100 0.0 0.0 88.1 .015 .025 .017

Reference Count 100 0.0 0.0 88.1 .010 .018 .012

Year of Pub 100 0.0 0.0 88.1 .005 .008 .006

Interpreting an MLR analysis requires an understanding of odds ratios. Odds

ratios are not the same as a predictive value (Grace-Martin, 2017). A continuous

variable odds ratio reflects the change in its odds ratio per one unit increase change in

the continuous variable. In this study one unit of change for words is 100 words and one

unit of change for references is 20 references. For example, using these values for the

reference variable, one unit of reference increase change (20 references) has an odds

ratio increase of 1.599 that it will belong to the high DSL over the low DSL as long as all

other variables remain constant. The low DSL is the reference category.

The odds ratios for a categorical variable like the research method variable also

assume all other variables remain constant. If one ignores the statistically insignificant

value (p = .762) for the categorical quantitative research method in the excessive DSL,

the odds ratio of 1.415 to 1 can be interpreted as a document which used quantitative

research method was 1.415 times likely to belong to the excessive DSL over the low

DSL as compared to a document one using the other research method (other is the

categorical comparison base). Likewise, if one ignores the statistically insignificant value

(p = .448) for the qualitative research method the odds ratio .332 to 1 interprets as the

qualitative research method is 68% (1 - .332) less likely to remain in the excessive DSL

over the low DSL as compared to the other research method category.

In summary, Table 8 exhibits statistically significant odds ratios for word counts

and reference counts at both the high and excessive levels. The model's overall

practical effect size using the Nagelkerke Pseudo R2 was .169. The practical effect size

is small.

Figure 14 provides visual representations in the form of scatter plots for both the

continuous independent variables word count and reference counts. What is noticeable

is the heteroscedasticity in the word count scatterplot and the opposing directions of the

regression lines between the word and reference count independent variables.

Table 8

MLR Analysis of Sampled Dissertations (Best Model) n=360

Dependent Variable Levels Independent Variables

B Std Error Wald df Sig Exp(B)a Odds Ratio

95% CI

High Document Similarity Level Publication Year Word Count (x100) Reference Count (x20) Qualitative Method Quantitative Method Other Methodb

445.994 -.223 -.380 .470 .970 .162 .0

277.477 .138 .145 .152 .663 .689

2.583 2.611 6.851 9.594 2.141 .055

1 1 1 1 1 1 0

----- .800 .684 1.599 2.639 1.175

----- .611 .515

1 .188 .719 .305

----- 1.049 .909

2.153 9.680 4.535

Excessive Document Similarity Level Publication Year Word Count (x100) Reference Count (x20) Qualitative Method Quantitative Method Other Methodb

388.141 -.194 -.881 .877 .347

-1.012 .0

553.192 .275 .380 .310

1.145 1.451

.499 5.362 8.014 .092 .576

1 1 1 1 1 1 0

------ .824 .414 2.403 1.415 .322

----- .481 .197

1.310 .150 .019

------ 1.411 .873 4.409

13.353 .000 .649

aConfidence Intervals for Exp at 95% bThis parameter is set to zero because it is the Research Method parameter reference category.

Note: Comparison Category is Low Document Similarity Level. Bold signifies statistically significant.

Published Article Descriptive Statistics Results

I also conducted a descriptive statistical analysis on the corpus of 360 HRD

related published articles (see Tables 9, 10, and 11). Table 9 exhibits the mean

statistics for relevant variables across the corpus of published articles. For instance,

published articles had a wide range of document characteristics such as a one-page

editorial at 699 words with one reference and the largest 53-page article at 24,463

words. Table 10 identifies potential relationships between the variables. Table 11

Figure 14. Dissertation scatter plots for word count and reference count variables. The green and orange lines divide the three DSL levels and the red line is the regression line.

provides frequencies and group aDSI and aDSW statistics by year, research method,

author counts, and DSLs.

Regarding Table 9, one must note that the descriptive statistics were based upon

the sample of 360 published articles. That was important for the substantive SSI. There

were only 81 out of 360 published articles with any substantive SSI. Thus, the

substantive SSI means were diluted down to extremely low levels. Table 10 provided

additional statistics on substantive SSI. Most important are the aDSI M = .11 (SD = .10)

and the substantive SSI other similarity of M = .01 (SD = .04) and substantive SSI self-

similarity M = .04 (SD = .12). As with the dissertations, the kurtosis statistics for the

published articles were very high across most of the variables. As previously discussed,

the removal of outliers is a common way to add normality to the sample, but as Grace-

Martin (2016) stated: “It is NOT acceptable to drop an observation just because it is an

outlier. They can be legitimate observations and are sometimes the most interesting

ones” (n.p.).

Table 9

Corpus Wide Descriptive Statistics for Published Articles (n=360)

Variable Min Max M SD Skewness S.E. Kurtosis S.E.

Year of Publication 2011 2015 2013 1.27 .24 .13 -1.21 .26

Author Count 1 18 2.49 1.86 3.19 .13 18.18 .26

Word Count 699 24463 8552 4186 .95 .13 1.37 .26

Reference Count 1 257 46 34 2.05 .13 7.88 .26

DSI .02 1.00 .67 .35 -.83 .13 -1.07 .26

DSWa 82 23484 5877 4516 .95 .13 1.04 .26

aDSI .00 .84 .11 .10 2.67 .13 10.40 .26

aDSWa 0 7823 900 956 2.83 .13 11.97 .26

SSI Other Countb 0 4 .12 .52 5.46 .13 32.58 .26

SSI Other Similarity Mb .00 .30 .01 .04 5.31 .13 29.41 .26

SSI Other Word Countab 0 4889 65 355 9.42 .13 110.17 .26

SSI Self Countb 0 6 .35 .91 3.36 .13 12.38 .26

SSI Self Similarity Mb .00 .96 .04 .12 4.22 .13 22.27 .26

SSI Self Word Countab 0 6692 328 977 3.93 .13 16.98 .26

aDSW, aDSW and SSI word counts are whole word similarities.

bSubstantive SSI or SSI equal to 5% or larger.

A basic understanding of the relationships between the variable data collected in

a study is important (Field, 2011). Normally, a Pearson's r correlation coefficient is the

statistic used to identify the relationship between two variables. However, Pearson's r

correlation coefficient has an assumption of data normality. I tested all of the variables

using a Shapiro-Wilk test for normality. All of the variables violated the assumption of

normality (p < .05). There are two camps on the importance of the assumption of

normality for a Pearson’s r statistic. Nefzger and Drasgow (1957) posited, "Tests of

significance of r do not in practice require normally distributed variates" (p. 623).

However, Field (2011) stated that the Spearman's rho correlational coefficient statistic is

a non-parametric that researchers use when the data violate the assumption of

normality. Table 10 identifies the relationships between the variables using the

Spearman's rho correlational coefficient statistic.

Moving on to Table 11, the analysis provided additional descriptive statistics for

various membership levels in several of the continuous and categorical variables. I

provided frequency counts as well as the mean statistics for the aDSI and the aDSW

within each group and subpopulation for additional insight. See Appendix B for a further

frequency breakdown of the research methods categorical variable within each DSL,

word count groupings, and reference count groupings.

Table 10

Corpus Wide Spearman's rho Correlational Coefficient Statistic for Published Articles (n=360)

Variable 1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 Year of Publication 1.00

2 Author Count .02 1.00

3 Word Count -.02 .12 a 1.00

4 Reference Count .01 .17 a .76a 1.00

5 DSI -.17a .08 .04 .10 1.00

6 DSW -.13a .12a .62a .50a .68a 1.00

7 aDSI -.08 .08 -.04 .12a .08 .03 1.00

8 aDSW -.08 .15a .44a .47a .10 .33a .84a 1.00

9 SSI Other Count .08 -.03 -.09a -.03 <.01 -.05 .35a .25a 1.00

10 SSI Other Similarity M .07 -.03 -.09 -.02 <.01 -.05 .35a .25a 1.00a 1.00

11 SSI Other Word Count -.07 -.03 -.08 -.02 <.01 -.05 .35a .25a 1.00a 1.00a 1.00

12 SSI Self Count -.09 -.01 -.04 .08 .03 .04 .56a .52a .16a .16a .16a 1.00

13 SSI Self Similarity M -.09 -.01 -.04 .09 .03 .04 .57a .52a .17a .17a .17a 1.00a 1.00

14 SSI Self Word Count -.09 -.01 -.06 .11a .03 .05 .57a .52a .16a .16a .16a 1.00a 1.00a 1.00

aCorrelation is significant at the p < 0.05 level (2-tailed)

Table 11

Descriptive Statistics for Published Articles Subpopulations (n=360)

Variable Level

Freq. Percent aDSI (range) SD aDSW SD

Year of Publication 360 100.0 .11 .10 900 956

2011 111 30.8 .11 .11 900 944

2012 78 21.7 .13 .11 1139 1247

2013 76 21.1 .10 .10 804 772

2014 81 22.5 .11 .11 814 849

2015 14 3.81 .07 .04 593 294

Research Method 360 100.0 .12 .10 1022 958

Quantitative 122 33.9 .13 .12 1233 1220

Qualitative 83 23.1 .08 .09 727 741

Other 155 43.0 .10 .09 732 732

Author Count 360 100.0 .11 .10 900 956

1 122 33.9 .12 .10 903 917

2 95 26.4 .12 .11 1028 956

3 76 21.1 .13 .10 1177 1147

4 33 9.2 .14 .11 1138 871

5 through 18 34 9.4 .12 .07 975 690

Document Similarity Level 360 100.0 .12 .10 1022 958

Low 285 79.2 .07 (.00-.14) .04 576 423

High 46 12.8 .19 (.15-.24) .03 1588 721

Excessive 29 8.1 .38 (.25-.84) .13 2995 164

Published Article Differential Statistics Results

Turnitin’s reported document similarity indices (DSI) and the adjusted document

Researchers often use the paired t-test statistic to test for differences between

pairs of data. However, a paired t-test assumes data normality with the difference

between the two variables. Figure 15 employs two histograms, which displayed the

visual difference between the published article DSIs and the aDSI. The left DSI

histogram exhibited a slight bimodal distribution while the right aDSI histogram was

unimodal with extreme skewness. The histograms provided a visual indication of non-

normality distributions.

However, the t-test assumption of normality is based upon the difference

between the two dependent variables. The statistical difference between the DSI and

aDSI tested produces a skewness statistic of -.640 (SE = .129), Z = 4.96 and a kurtosis

statistic of -1.355 (SE = .256), Z = 5.29 (Cramer, 1998; Cramer & Howitt, 2004). The

histogram (see Figure 16) illustrates a bimodal distribution with this skewness. A

Shapiro-Wilk Test for Normality statistic was conducted on the difference (Shapiro &

Figure 15. Published article DSI & aDSI frequencies before and after adjustments. The black bars represent the Document Similarity Levels separators for Low, High, and Excessive.

Wilk, 1965; Field, 2011). The results returned a statistic of .792 (df = 360, p < .05). The

data tested violated the assumption of normality so a paired t-test would not be suitable.

Because of the confirmed abnormality of the data, I conducted a Wilcoxon

Signed-rank statistic (Zimmerman, 1996; Field, 2011). As previously discussed in the

Methodology section, the Wilcoxon Signed-rank statistics ranked the data sets and then

compared the differences between the two sets. The Wilcoxon Signed-rank test

confirmed the significance of the difference between the DSI and aDSI reporting Z =

-15.626 (p < .05). The Wilcoxon Signed-rank effect size estimate r value of .58 (p < .05)

Figure 16. Histogram exhibiting difference between dissertation paired DSI & aDSI.

provide strong support that these adjustments were statistically and practically

significant (Field, 2011; Rosenthal, 1991). Rosnow and Rosenthal stated a Wilcoxon

Signed-rank effect size estimate r value of 0.50 and above is large (2005).

Published Article Predictive Statistics Results

I also conducted a multinomial logistic regression (MLR) analysis to predict

published article membership outcomes in the DSL. The goal of an MLR analysis it to

determine what independent variables influence the probability of a document belonging

to one of the three DSLs (see Figure 17).

Figure 17. Published article membership in DSL groups.

Overall, within the corpus of 360 published articles, SRT identified 285 cases or

79.2 % in the low plagiarism range, 46 cases or 12.8% in the high plagiarism range, and

29 cases or 8.1% in the excessive plagiarism range. However, the document research

method “mixed” had only 4 cases in the high DSL and only one in the excessive DSL.

The research method “other” classification had 15 members in the high DSL and 12

members in the excessive DSL. Lee, Ahn, Moon, Kodell and Chen (2013) stated,

"Although logistic regression is known to be robust as a classification method and is

widely used, it requires that there be more observations than predictors" (p. 682).

With 360 observations, I have met that criterion. However, with such a high

concentration (separation Issues) in the low DSL, I again suspected, the prediction

would be difficult (Anderson, 1984). As previously noted, to improve the model I moved

the “mixed” research method cases into the "other" research method cases.

The Nagelkerke's Pseudo R2 is a model based pseudo effect size statistic used in

MLR to explain the importance of the predictors in terms that closely behave like a

linear model (Allison, 2013). Bewick, Cheek, and Ball (2005) explained using

Nagelkerke’s Pseudo R2 to demonstrate “how useful the explanatory variables are in

predicting the response variable and can be referred to as measures of effect size” (p.

116). With the available predictors arranged in all of the possible combinations, the

highest Nagelkerke's Pseudo R2 of .095 was realized using Research Method, Year of

Pub, Author Count, Word Count, and Reference Count as predictors (see Table 12).

Table 12

Modeling MLR Analysis of Sampled Published Articles

Model Classification Percentages Effect size (Pseudo R2)

Low High Exces-sive

Total Cox & Snell Nagelkerke Mcfadden

Research Method Year of Pub Author Count Word Count Reference Count

100 0 0 79.2 .069 .095 .055

Research Method Author Count Word Count Reference Count

100 0 0 79.2 .061 .084 .049

Research Method Year of Pub Word Count Reference Count

100 0 0 79.2 .061 .084 .048

Research Method Word Count Reference Count

100 0 0 79.2 .053 .072 .041

Year of Pub Author Count Word Count Reference Count

100 0 0 79.2 .049 .067 .038

Year of Pub Word Count Reference Count

100 0 0 79.2 .044 .061 .035

Author Count Word Count Reference Count

100 0 0 79.2 .041 .056 .032

Research Method Year of Pub Author Count Reference Count

100 0 0 79.2 .038 .052 .030

Word Count Reference Count

100 0 0 79.2 .036 .050 .028

Research Method Year of Pub Author Count

100 0 0 79.2 .037 .051 .029

Word Count

Research Method Year of Pub Author Count

100 0 0 79.2 .033 .046 .026

Research Method Year of Pub Word Count

100 0 0 79.2 .032 .044 .035

Research Method Year of Pub Reference Count

100 0 0 79.2 .031 .043 .024

Research Method Author Count Reference Count

100 0 0 79.2 .030 .041 .023

Research Method Author Count Word Count

100 0 0 79.2 .029 .040 .023

Research Method Year of Pub

100 0 0 79.2 .028 .038 .022

Research Method Author Count

100 0 0 79.2 .025 .035 .020

Research Method Word Count

100 0 0 79.2 .024 .032 .018

Research Method Reference Count

100 0 0 79.2 .023 .032 .018

Research Method 100 0 0 79.2 .019 .027 .015

Year of Pub Author Count Reference Count

100 0 0 79.2 .017 .023 .013

Year of Pub Author Count Word Count

100 0 0 79.2 .015 .021 .012

Year of Pub Reference Count

100 0 0 79.2 .013 .018 .010

Year of Pub Word Count

100 0 0 79.2 .012 .017 .010

Year of Pub Author Count

100 0 0 79.2 .012 .016 .009

Year of Pub 100 0 0 79.2 .009 .012 .007

Author Count Reference Count

100 0 0 79.2 .009 .012 .007

Author Count Word Count

100 0 0 79.2 .007 .009 .005

Reference Count 100 0 0 79.2 .005 .006 .004

Author Count 100 0 0 79.2 .003 .005 .003

Word Count 100 0 0 79.2 .004 .005 .003

Table 13 exhibits statistically significant odds ratios for word count and reference

counts in the excessive DSL. A continuous variable odds ratio reflects the change in its

odds ratio per one unit increase change in the continuous variable. In this study one

unit of change for words is 100 words and one unit of change for references is 20

references. For example, using these values for the references variable, one unit of

reference increase change (20 references) has an odds ratio increase of 3.666 that it

will belong to the excessive DSL over the low DSL as long as all other variables remain

constant. Regarding the word count variable, one unit of word increase change (100

words) has an odds ratio of .124 to 1, which interprets as that this change in words is

88% (1 - .124) less likely to remain in the excessive DSL over the low DSL as long as all

other variables remain constant.

In summary, Table 13 exhibits statistically significant odds ratios for word counts

and reference counts at only the excessive level. The model's overall practical effect

size using the Nagelkerke Pseudo R2 was .095. The practical effect size is very small.

Table 13

MLR Analysis of Sampled Published Articles (Best Model) n=360

Dependent Variable Levels Independent Variable

B Std Error Wald df Sig Exp(B)a

Odds Ratio

95% CI

High Document Similarity Level Publication Year Author Count Word Count (x100)

Reference Count (x20) Qualitative Method Quantitative Method Other Methodb

482.227 -.240 -.061 -.592 .498 .273

-.015 .0

272.945 .136 .100 .434 .268 .381 .428

3.121 3.143 .369

1.857 3.443 .516 .001

1 1 1 1 1 1 1 0

------ .786 .941 .553 1.646 1.314 .985

------ .603 .773 .236 .972 .623 .425

------ 1.026 1.145 1.296 2.785 2.770 2.281

Excessive Document Similarity Level Publication Year Author Count Word Count (x100) Reference Count (x20) Qualitative Method Quantitative Method Other Methodb

123.470 -.062 -.155

-2.084 1.299 .680

-1.019 .0

339.086 .168 .152 .557 .316 .450 .794

.136 1.040

13.996 16.878 2.147 1.648

1 1 1 111 1 0

------ .940 .856 .124

3.666 1.934 .361

------ .675 .636 .042

1.973 .800 .076

------ 1.307 1.154 .371

6.814 4.673 1.711

aConfidence Intervals for Exp at 95% bThis parameter is set to zero because it is the Research Method reference category.

Note: Comparison Category is Low Document Similarity Level. Bold signifies statistically significant.

Figure 18 provides visual representations in the form of scatter plots for both the

continuous independent variables word count and reference counts. What is noticeable is

the heteroscedasticity in both the word and reference count scatter plots. Moreover, just

as with the dissertation scatter plots for these two variables, it is evident that the

regression lines between the word and reference count independent variables are going in

opposite directions.

Figure 18. Published article scatter plots for word count and reference count variables. The green and orange lines divide the three DSL levels and the red line is the regression line.

DISCUSSION

This discussion section combined the two corpora and compared the analysis

results and the methodologies of this study with prior studies. These comparisons and

expanded explanations provided an overview of the memberships in each DSL for

dissertations and published articles. Moreover, I discussed the implications this study

has for stakeholders as well as the direction of future research. Additionally, I reviewed

various encountered issues and obstacles with the intent to help future researchers

avoid some of the mistakes I made.

Discuss and Synthesize Research Findings

The academic and publishing profession, with good reason, has yet to develop

an agreed upon criteria for evaluating plagiarism and content similarity values. First,

similarity values, by themselves, are not proof of plagiarism. Second, there is a political

overtone tied to adjudicating plagiarism, as evidenced by Moore (2008), a managing

editor for the Ventura County Star, when he stated, “We have zero tolerance for

plagiarism” (n.p.). Thus, publically there cannot be any acceptable level of plagiarism.

However, with limited reviewing resources (Willis, 2016), the application of plagiarism

levels can effectively allocate reviewer resources in the areas where they will be the

most effective in detecting plagiarism. Finally, there are no commonly accepted best

practices for detecting false positives. An examination of the Turnitin documentation and

prior studies did not provide any finite instructions that would guide reviewers or lead

researchers towards research “replicability” (Fournier, 2016, n.p.). This study provided

me with the opportunity to document both a set of specific processing steps for an

effective COA analysis and a set of synthesized similarity levels. Similarity levels can

serve institutions and publishers in identifying potential plagiarism events and the

appropriate remedies for their adjudication.

Descriptive Findings

For the majority of corpus plagiarism studies, descriptive statistics have been

instrumental in bringing a broad understanding to the phenomena of plagiarism (Honig

& Bedi 2012; Sun, 2012; Youmans, 2011; Batane, 2010; Keck, 2006). This exploratory

study examined two different corpora of documents that provided two sets of descriptive

data and the opportunity for cross checking. The Turnitin COA reports and derived data

provided a rich set of descriptive statistics that provided the DSI, aDSI, DSW, and

aDSW variables across the corpus of dissertations and published articles. This study

called attention to the synthesized word count variables DSW and aDSW because they

added a perspective that went beyond what the percentages alone provided.

The most important descriptive statistics gathered from this study were the DSI

and the aDSI. Corpus plagiarism studies often provide plagiarism data regarding

document plagiarism (similarity) by percentages or plagiarized word counts. However,

prior research has not been clear about whether the reported plagiarism statistics were

based on what Turnitin initially provided or if and how the investigator had adjusted the

Turnitin DSI down to an aDSI. I reported the DSI, the aDSI, the DSW, and the aDSW.

Using the DSI and synthesized DSW as my starting point, I found the aDSI statistics for

the dissertations (M = .09, SD = .06) was about 80% of the aDSI of the published

articles (M = .11, SD = .10). However, the aDSW for the dissertations (M = 3828, SD =

2484) was almost four times the aDSW to the published articles (M = 900, SD = 956).

These findings illustrate the importance of recording the number of content words in a

plagiarism study document.

In support, Ison (2012) suggested that the number of potential words plagiarized

is as important as the percent of the document plagiarized. Ison reported word

similarities based upon document words and the percentage of the document

similarities. Honig and Bedi (2012) also used plagiarized words as their primary statistic

over the percentage plagiarized for identifying important findings in their study.

While these findings infer that plagiarized word counts across various corpora

are important in COA research, researchers should examine the wide ranges in

document word counts within a corpus. For instance, my study recorded the smallest

article at 699 words (one page editorial with one reference) and the largest article at

24,463 words. The ratio difference (1 to 34) between those two articles is larger than the

mean word ratio differences (1 to 6) between the corpus means of the publish articles

(M = 8552, SD = 4186) and the dissertations (M = 48603, SD = 22833).

My study also collected 1193 instances of non-excluded dissertation SSI equal or

larger than 1% of the dissertation corpus and 2254 instances of the same from the

published article corpus. From that collection, I identified 14 dissertation plagiarism-of-

others substantive SSIs (M = .06, SD = .02) and from 167 published article instances, I

identified 42 plagiarism-of-other substantive SSI (M = .08, SD = .05) and 125 instances

of substantive plagiarism-of-self SSI (M = .12, SD = .09).

These figures lead to an important finding that my study identified. By far, after all

the false positives were removed, the majority of the individual SSIs were 4% or less.

Very few dissertations or published articles had substantive SSIs. Thus, the absence of

substantive SSIs leads one to conclude that most plagiarism issues are an

accumulation of small infractions.

Secondary findings from this study were derived from the document metadata in

each corpus. Publishing year, research methods, author counts, word counts, and

reference counts provided good statistics that institutions and publishers could use to

research and establish submission expectations. A breakdown of the documented

publishing years indicated that publishing years were well represented in HRD related

dissertations and published articles (see Table 6 and Table 11). However, one should

understand that the publishing year does not always represent the year of the study or

the date of authorship.

It was no surprise that these corpora predominantly employed quantitative and

qualitative research methods. However, while low, dissertations as compared with

published articles employed almost double the number of mixed-method research

designs. Moreover, published articles had a significant number of other (non-empirical)

articles (see Appendix A and B for the breakdown between each DSL). One should also

note that the quantitative research method designs had higher mean aDSI rates in the

corpora (see Tables 6 and 11).

Dissertations always had a single author, whereas this sample of published

articles listed author counts ranging between 1 and 18. With these results, this study

could only utilize author count as an independent variable for the publish article corpus.

While I have previously noted that dissertations had almost six times the number of

words that published articles had in the study’s data, one should note that dissertations

occasionally contained sizeable appendices, which added to the word count.

Published articles had more references per 1,000 words than dissertations.

Comparing the dissertation’s mean of 48,603 words with the mean of 140 references to

the published articles means of 8552 words with the means of 46 references, we have

dissertation with 2.88 references per 1000 words (r = .53) and published articles with

5.38 references per 1000 words (r = .77).

As previously discussed, this study defined a synthesized set of DSLs using

various prior classification levels as a foundation. Derived from Thomas and de Bruin

(2014), the Higher Education Commission, Pakistan (n.d.), and others (see Table 2),

these levels in Tables 6 and 11 identify the document counts, mean aDSI, and mean

aDSW in each of the three DSLs for dissertations and published articles.

This study examined a corpus of 360 dissertations using Turnitin and returned

88.1% of the dissertations in the low levels of plagiarism, 9.7% in the high level and

2.2% in the excessive level. Table 14 compares the results of this study to several other

important dissertation plagiarism studies.

Table 14

Dissertation Corpus Plagiarism Studies using Turnitin

Authors Year Corpus n aDSI (SD)

Low 0%-14%

High 15%-24%

Excessive 25%-100%

Mayes (Current) 2017

Online & Traditional HRD Dissertations

360 .09(.06) 88.1% 9.7% 2.2%

Isona 2014 TraditionalDissertations 184 .15(.13) 54.0% 28.0% 18.0%

Isona 2014 Online Dissertations 184 .14 (.08) 54.0% 36.0% 10.0%

Isona 2012 Online Dissertations 100 .15 (.13) 52.0% 34.0% 14.0%

aI estimated DSL Membership using interpolation techniques.

Ison (2012) studied 100 dissertations retrieved from ProQuest with publication

years from 2009 to 2011. Using Turnitin to analyze the dissertations for evidence of

plagiarism, he applied Ison’s data to Bretag and Mahmud (2009) similarity levels. His

analysis identified 40% of the dissertations with no or low plagiarism (DSI of 0 to .10)

and 46% of the dissertations with medium plagiarism (.11-.24). He also found that 11%

of the dissertations had high plagiarism (DSI of .25-.49) and 3% (DSI of .50 and above)

had excessive plagiarism.

Furthering his dissertation plagiarism research, Ison (2014) examined a corpus of

368 dissertations published between 2009 and 2013 for evidence of plagiarism. He

compared 184 dissertations from predominantly on-line programs to 184 dissertations

from traditional institutions again using Bretag and Mahmud (2009) similarity levels. In

both Ison studies, there were limited details about Turnitin configurations nor any details

about false positive exclusions.

The results show a sizeable difference between Ison’s and my study’s results.

Ison's results are consistent across all three corpora and are a good indication that

consistent methodologies produce consistent results. Ison described the procedures he

used in both of his studies to avoid or remove false positives:

Quotations and definitions were manually omitted from the analysis. Results

were examined for potential similarity overlaps with work previously submitted by

the author at the institution that awarded the doctorate. Such overlaps were then

excluded from the final similarity score used in this study (2012, p. 231-232).

Quotations, bibliographies, and definitions were omitted from the analysis. The

initial similarity indices were examined for potential overlaps with work previously

submitted by the author at the institution that awarded the doctorate and were

subsequently removed if applicable. (2014, p.276)

The important finding from my study’s dissertation descriptive statistics, when

compared to Ison's dissertation descriptive statistics, was that identifying the plagiaristic

activity with a corpus of dissertations is merely a snapshot and highly dependent upon

the various configurations and methodologies employed to reduce the false positives.

My study also examined a corpus of 360 published articles using Turnitin and returned

79.2% of the published articles in the low levels of plagiarism, 12.8% in the high level

and 8.1% in the excessive level. Table 15 compares the results of this study to several

other important studies on published article plagiarism.

Table 15

Published Articles Corpus Plagiarism Studies using Turnitin

Authors Year Corpus N aDSI (SD)

Low 0%-14%

High 15%-24%

Excessive 25%-100%

Mayes (Current) 2017 Published HRD

Articles 360 .11 (.10) 79.2% 12.8% 8.1%

Thomas & de Bruina 2014

South African Management journal articles

371 .17 (.12) 51.5% 27.2% 21.3%

Suna 2013 Published STEMArticles 300 .13 (.12)

64% 31% 5% Breakdowns between STEM and Social

Science articles were not published. Suna 2013 Published SocialSciences Articles 300 .08 (.08)

Honig & Bedi 2012 Presentation Papers 279 No comparable descriptive statistical percentages.

The study used word counts only.

aI estimated DSL Membership using interpolation techniques.

Thomas and de Bruin (2014) submitted a corpus of peer-reviewed articles

(N=371) to Turnitin for COA analysis. They reported corpus descriptive statistics that

found 51.5% of the published articles were members in the low group. The high

plagiarism group had 27.2% membership, and the excessive plagiarism group had

21.3% membership. These researchers provided a detailed description of their

methodology for removing false positives:

The results for each article were checked twice and a conservative approach was

adopted in the interpretation of the similarity indices, in which the benefit of doubt

was in favour of the authors. For each article, the following content was not

included in the assessment of similarity: bibliography/list of references,

quotations, strings of words of less than 10, student write-ups on which the article

was based, conference proceedings and abstracts detailing the main features of

the article. In addition, during the second inspection of the data, specific

methodological terms and statistical or mathematical formulae were excluded in

the analysis of similarity. (Thomas & de Bruin, 2014, p.3)

Sun (2013) examined 600 articles, also using Turnitin, and reported membership

in six different DSLs. Sun excluded all similarities less than 30 words to avoid trivial

similarity issues. I experimented with Sun's 30-word exclusion algorithm and found it

greatly reduced the number of paraphrasing issues, which could have been counted as

plagiarism. Sun further described the method used to remove additional false positives:

A manual check was employed by the researcher of this study to make

qualitative judgments on the appropriateness of textual re-use. In the current

study, human screening of each match was conducted across 4247 matches and

the following types of matches were excluded from the results: (1) the exact

article found in the Turnitin database, (2) text in quotation marks or displayed

quotations, (3) formulae and terminology, and (4) article titles and author

information. (2013, p. 267)

Honig and Bedi (2012) examined 279 papers by 636 authors who presented at

the International Management Division of the 2009 Academy of Management

conference. In describing the methodology for removing false positives, they stated:

“We manually checked the highlighted sections for appropriate citations and excluded

the methodology section of all empirical papers” (pp.112-113). They reported that

25.44% of the corpus (71 papers) they reviewed had some level of plagiarism and that

13.6% of the corpus (38 papers) had significant levels (5% or more of plagiarized

content).

My attempt to compare my study with these studies demonstrated that there is a

lack of conformity and continuity between all COA corpus studies. However, the current

accumulation of plagiarism research affirms that both the incidence of plagiarism and

the method of researching plagiarism are serious issues, which need addressing.

Differential Findings

Software like Turnitin is best suited for detecting potential plagiarism on

unpublished manuscripts. That being said, using plagiarism detections software on

previously published corpora, while requiring difficult and time-consuming processes,

can be effective. However, finding and excluding false positives are crucial to the

accuracy of a COA. While this process is person-power intensive, if COA results are to

be meaningful (valid and replicable) the process must be well defined and executed with

sufficient adherence to the process. However, there is little in the literature that provides

definitive instruction. Turnitin indicated that COA reports are just a starting point, and

that investigator intervention is required. They provided no statistics that expressed the

importance of the intervention or how the investigator should execute the intervention.

An important finding from this study is that there is strong statistical support for

the removal of false positives in Turnitin COA reports. My research determined that the

differences between the initial DSI generated by Turnitin and the aDSI I calculated are

statistically and practically significant. The Wilcoxon Signed-rank statistics revealed a

statistically robust and practically significant difference in the corpus of dissertations

before (DSI) and after (aDSI) variables, p < .05 with a large Wilcoxon Signed-rank effect

size (r =.52) and in the corpus of published article before (DSI) and after (aDSI)

variables, p < .05 with a large Wilcoxon Signed-rank effect size (r =.58). The findings

from my study’s differential statistic tests demonstrated that for both corpora there was

a statistically and practically significant need to investigate and remove false positive

SSIs from the COAs. Bypassing this process on either corpus would render COA results

invalid and meaningless. These results confirmed the need for the removal of false

positives.

In support, Batane (2010) reported that he found when using Turnitin, the COA

report suffered from a “tendency of the software to identify material as plagiarized” (p.

3). He suggested that Turnitin users verify all instances of identified content similarities

and make the necessary adjustments. Braumoeller and Gaines (2001) in a study using

a software called EVE, noted that false positives were an issue that clouded their

plagiarism detection analysis. They found that almost 50% of the papers flagged, had

been properly cited and referenced. Sun (2013) in her study seemed to reduce the

issue of false positives by using an exclusion for similarities of less than an arbitrary 30-

word cut-off point. Sun concluded that " the quantitative findings of the current study

indicate that authors in contexts wherein English is an official language do not differ

significantly from their counterparts on their Turnitin scores or the number of 30-word or

longer strings of successive matching text from self-published articles and self- and-

others’ publications combined." (p.270). Recounting my experience testing a 30-word

exclusion, I found that this configuration eliminated many sentence size instances of

paraphrased plagiarism. Jocoy and DiBiase (2006) performed manual document checks

to identify false positives and used an aDSI for their analysis. Martin, Rao, and Sloan

(2011) individually removed all cited, quoted, and bibliographic references and manually

checked for potential false positives.

Honig and Bedi (2012) reported that to reduce the number of false positives that

they manually checked the highlighted areas within the manuscripts for false positives.

Moreover, they excluded the methodology and reference sections of all empirical

papers. Regarding plagiarism-of-self, they stated:

Since the focus of this study was on individuals plagiarizing others’ work without

appropriate acknowledgment, we adopted a more conservative approach toward

self-plagiarism. If authors used sections from their own previous work or cited the

primary source, then it was not considered plagiarism. (p.112)

Prediction Findings

My study focused upon five predictors derived from the available metadata

gathered from the two corpora. Being that participation in plagiarism activity is a

personal decision, I did not expect any strong prediction based upon document

metadata. However, this study did find that the dissertation or publish article metadata

provided some evidence of plagiarism prediction. Using multinomial logistic regression

statistics, I determined that the optimal model Nagelkerke Pseudo R2 for the

dissertations was .169 and for the published articles was .095. Moreover, the

independent variables that had statistically significant values (odds ratios) were word

counts and reference counts. First, increases in reference counts predicted increases of

plagiarism, while increases of word counts predicted reductions in plagiarism.

Most of the other plagiarism prediction research was based upon controlled

interventions, human demographics, topic knowledge, and situations. Braumoeller and

Gaines (2001) used a software called EVE to conduct a plagiarism study using student

assignments with and without student warnings explaining plagiarism detection. While

they concluded that “The results of plagiarism tests should not be taken to be definitive”

(p. 836), they also stated that “At this stage, plagiarism-detection software is useful” (p.

836). While obtaining inconclusive plagiarism statistics, they predicted that just warning

students about plagiarism detection seemed to make students more diligent in avoiding

plagiarism activities.

Jocoy and DiBiase (2006) also attempted to predict the outcomes from a student

intervention but used Turnitin for plagiarism detection instead of EVE. Having measured

plagiarism rates on their students’ first assignment without the students having any

knowledge of Turnitin, they then provided a Turnitin demonstration and warned about

the consequences of plagiarism. When Jocoy and DiBiase evaluated the second

assignment, they suggested that they found a statistically insignificant, but measurable

drop in plagiarism rates.

Ling (2006) researched the premise that L1 (English as the first language) and

L2 (English as a second language) students had different perceptions regarding

plagiarism. Having conducted semi-structured interviews with 46 participants, she

concluded that the results were mixed. She also found that citation experiences varied

from participant to participant and that subjects from various cultures believed that

words are to be shared by all and not owned. Moreover, her research found that L2

students viewed the science of citations, paraphrasing, and plagiarism as hurdles to

their authoring skills.

Hege (2008) found that the affective state of mind could predict the ability to

recall an idea’s source. She applied her research to plagiarism and implied that those

authors in a good mood are more likely to forget the source of the ideas they are using.

Thus, accurate citing might be impaired.

One of the most interesting studies on prediction is Batane’s (2010) student

intervention study; it is similar to the Jocoy and DiBiase (2006) study. Measuring COA

rates with Turnitin on the first assignment from 272 students who were unaware that

Turnitin was being used to check their work, the students’ mean aDSI was reported at

.20. Assignments with evidence of plagiarism were downgraded and returned to the

students. After introducing the students to Turnitin and stating they were using the

software to check their second assignment, the reported mean aDSI was reduced by

.04 to .16. Moreover, in a category measure used by Batane, labeled as legitimate

research, measured at an aDSI of zero, the percentage of documents at that level rose

from 14.4% to 52.1%.

Martin, Rao, and Sloan (2011) studied 158 students using one specific

assignment submitted by each student to Turnitin. The focus of their study was to

predict plagiarism based upon participant demographic data. They used ethnic markers

for Caucasian and Asian students. They found no differences in predicting plagiarism

that ethnicity could explain. However, heritage or acculturation or the time spent in a

culture did show some linkage to plagiarism. Using a Manova Multivariate Test (Wilks'

Statistic) they identified an overall moderate effect size of ƞ2=.11 (p=.037).

Honig and Bedi (2012) had access to several author related demographics in the

course of a submission process at the 2009 Academy of Management conference. They

used this information in an examination of plagiarism prediction across 279 documents.

They wanted to know if plagiarism was affected by an author being a tenured or senior

scholar, by L1 or L2 authors, by an author’s country, by an author’s country being

established or matured, and finally by an author’s gender. They found no statistical

support for prediction concerning an author being a tenured or senior scholar or the

author’s gender. However, they did find statistical support for predictions for gender as a

moderating force with an author’s country being newly established or less matured

(more plagiarism), and between authors with L1 or L2 (more plagiarism) backgrounds.

Sun (2013) configured Turnitin for a 30-word exclusion for checking content

similarities and found that the number of authors did influence plagiarism rates (the

single author was the lowest). With her configuration at ten times the default Turnitin

setting, she found that an author’s official language did not influence plagiarism rates.

Ison (2012, 2014) conducted dissertation corpus studies and explored the premise that

on-line dissertations would have higher plagiarism rates than more conventional

institutions. He did not find any evidence that supported these hypotheses.

Given the various, sometimes conflicting evidence of prediction, it remains clear

that better, more defined techniques need to be developed. Plagiarism corpus studies

are very complex. The slightest variance in strategies and techniques can have a huge

impact on the reported results. However, the constant that seems to reappear is that

educating the student or author about plagiarism and its detection does influence

outcomes.

Discuss Document Similarity Levels

Excessive Levels of Plagiarism

Across the corpora of 360 dissertations and 360 published articles, this study

experienced 37 documents in the excessive DSL (25% - 100%). Rounding out the ten

highest plagiarized documents in this level of plagiarism were published articles where

large blocks of content came from prior articles or book chapters from the same

author(s). Often the abstracts were identical, and the titles were similar. Sometimes the

author lists varied. For instance, three authors instead of two, or authors listed in

different orders were common variances. The articles were normally published by

different publishers. However, the articles appeared to be submitted around the same

time to different journals, but more often than not, the final acceptance dates were

different.

Another common type of plagiarism-of-self at this level was covering the same

topic using different approaches. Occasionally, I found an article proposing a theory and

design for a particular type of study, then another article for the presentation of the

completed study, then a follow-up article using the same data but employing group

breakouts. Also, there were authors who published additional articles that discussed the

validity of the prior studies. Most of the articles at this level also had some serious

paraphrasing issues (plagiarism-of-others), but by far plagiarism-of-self issues

dominated the articles at the excessive level. While there are valid reasons for creating

a series of articles, it is critical that the readership is aware of the potential duplications

and that publishers have granted permissions to revisit prior published research or

articles.

Dissertations belonging to the excessive DSL had serious paraphrasing issues in

plagiarism-of-others and no plagiarism-of-self complications. Turnitin identified

paragraph after paragraph, borrowed from other authors, where the researcher had

changed only a few words. The number of SSIs was often as high as 200. However, the

amount each SSI contributed was normally less than the 5% threshold.

I did find one instance where a Ph.D. candidate had authored a published article

and plagiarized it in their dissertation that followed. This SSI was calculated at 4%. As

more and more doctoral candidates publish, this phenomenon may become more

common. At the excessive DSL, published articles and dissertations would need major

rewriting, probably be rejected, and the candidate failed. Furthermore, the ethical

implications should be investigated by the publisher, or in the case of a dissertation, the

institution.

High Levels of Plagiarism

Across this corpora of 360 dissertations and 360 published articles, this study

found 81 documents in the high DSL (15% - 24%). Published articles and dissertations

equally dominated membership in the high DSL. Paraphrasing issues were the

dominant problem. Having sentence size blocks of text without double quotes remained

a serious issue with all documents at this DSL. More often than not, the content

similarities were cited, but there are no double quote marks or indented paragraphs that

indicated these were not the author’s words. Several of the articles used single quote

marks at this level causing Turnitin to identify the content in the SSI and DSI. However,

there were still many instances of paraphrasing that influenced the DSI.

At the high DSL, it is impractical for published articles and dissertations to be

corrected without major rewrites. However, the author would be faced with rewriting the

manuscript, paragraph-by-paragraph, adding, or retaining required citations, and

possibly quotation marks. An author who does not want to improve their attempt at

paraphrasing would have to consider whether putting so much of the document in

double quotes would leave a reader wondering what was original. However, this is a

chance for institutions and publishers to influence an author’s research and writing

habits. Providing guidance at this point could correct future issues and have an effect on

the quality of research being published. As I was examining one high-DSL article, it was

discovered that the author had not plagiarized. He had written a paper he published in

the ERIC database and had borrowed part of it for a recent article. According to ERIC's

(2017) submission policies:

When you contribute your work to ERIC, you grant permission to index the

material and disseminate it online. You do not transfer copyright to ERIC and

may seek publication. (n.p)

When I excluded that source, it descended into the low DSL. This illustrates the due

diligence an investigator must possess to evaluate each instance of similarity.

Low Levels of Plagiarism

This study examined 602 documents out of the 720 documents across the two

corpora that belonged to the lowest level of the DSL. The reader may find interest in

that many of the published articles and dissertations in the lowest levels of plagiarism

(0% - 14%) had some of the highest starting Turnitin DSIs. I commonly found that the

first one or two exclusions (false positives) removed almost all of the reported content

similarities. Several cases of online institution dissertations had several committee

members repeatedly submitting these dissertations to Turnitin at various institutions.

Often Turnitin noticed this and reported DSIs well over .9. One can see the seriousness

of these issues as they show up in the histograms as this phenomenon created the

starting bimodal distributions (see Figures 9 and 12).

I found that it was common for short essays and editorials in our random sample

of published articles to have low levels of content similarities. While the ideas may not

be solely the author’s, the content was written in their words. Often there were a limited

number of citations and references. Overall, having 602 out of 720 documents in the low

level is a very positive outcome for the field of HRD.

Issues and Obstacles

I want to re-emphasize that Turnitin was not engineered for corpus wide COA

studies using published documents. However, this study will include discussions points

concerning potential feature changes or additions. That being said, many of the corpus

COA issues in this study seemed to be from inadequacies in the understanding of the

Turnitin software design and use. A Google search for a “Turnitin technical manual”

produced 415,000 results starting with “Getting Started, Student Manual, Instructor

Manual” and “User Manual.” There was little, which technically explained what happens

and how and why Turnitin does certain things.

For instance, in this study, I chose to bypass all similarity groups consisting of 10

words or fewer. A Turnitin anomaly may be that the 10-word exemption does not always

work. Below is an example where Turnitin flagged six words for inclusion in the DSI.

One can certainly see that this content is trivial in nature and does not meet the ten-

word exclusion (see Figure 19).

A common Turnitin setting removes quoted content from the DSI calculations.

However, as previously stated, this study found published papers where single quotes

were used for quotations, so Turnitin counted those direct quotations in the DSI.

Moreover, I found that DSI values of .01 - .02 were often a re-publishing of the title,

Figure 19. Trivial similarities of six words using a 10 word exemption.

abstract and the table of contents (TOC-dissertations only) across many different

marketing or reference sources. These marketing republishings created a monumental

task to exclude those SSI that referred to titles, abstracts, and tables of content.

Another challenge confronting corpus COA studies is how to address layered or

overlapping similarities. These are similarities, which exist across a multitude of

documents (Tucci & Galwankar, 2011). Several times, I had to delete a multitude of

SSIs, one at a time, just to reduce a DSI by .01 or .02. As I excluded these SSIs,

Turnitin produced more SSI for the submitted document with the same false positives.

These similarities were often trivial things like government agency names and

addresses, or institutional or publisher contact information, survey identification, and

even topic names. Even more troubling is that an SSI may have a combination of valid

and false positives. Additionally, an SSI can include plagiarism-of-others and

plagiarism-of-self. Turnitin has no apparent means to internally remove part of an SSI or

separate out and document those differences.

Often when the DSI was .90 or larger, the investigation pointed to duplication of

documents. However, I found several documents (primarily published articles) where

the only portion of the readable text was a standardized copyright notice use by a large

publisher. The main part of the document was a scanned image. Turnitin reported a

1.00 DSI because of the copyright notice, which had been published in hundreds, if not

thousands of other documents, was the only part of the document Turnitin could read. I

employed Adobe Acrobat’s image-to-text OCR feature to resolve this type of issue

before resubmitting the document to Turnitin.

During the dissertation corpus COA, I observed that most of the dissertations

were present in full or in part on the Internet at various locations. To name a few,

Internet URLs such as gradworks.umi.com, search.proquest.com, media.proquest.com,

www.coursehero.com, and www.researchgate.com provided copies of the same

documents or abstracts and links to the same document. I commonly experienced .01 to

.02 of the DSI were often linked to gradworks.umi.com, search.proquest.com,

media.proquest.com and were not instances of plagiarism, but of title or abstract

listings. The use of Turnitin at multiple institutions created large similarity issues.

However, often these issues stemmed from the same reviewer working at different

institutions repeatedly reviewing the same documents.

Furthermore, while reviewing the dissertations, it was possible to find similar, if

not identical dissertations under two different author names. Upon careful examination, I

realized the issue was not always a plagiarism issue because only the last name had

changed. The preponderance of the evidence suggested the change was a result of

marriage, divorce, or other legal name change event.

One of the dissertations had most of the references highlighted as Turnitin

content similarities. There was no internal way to exclude those Turnitin false positives.

I concluded that the attached appendices may have cause Turnitin to include the

reference list in its COA report despite its configurations instructing it to ignore reference

lists. Based upon a 100-page count for content and nine pages of references, I

estimated that by eliminating the anomaly, the aDSI would be reduced by .10, still

leaving the dissertation membership in the excessive level. Another dissertation was

authored with references included in each section (multiple article format). Again, the

format confused Turnitin, leaving it to identify the reference lists as content similarities

with extensive overlapping. To avoid these issues, it would be best to delete reference

sections before a Turnitin submission.

A minor issue that I need to mention is that sometimes an article shares pages

with the beginning or end of other articles. These submissions to Turnitin produced a

COA report on all of the submitted text. An investigator would be required to separate

out the similarities generated by the pieces of the adjoining articles.

While this study did not include the Turnitin student DCD in this study, I

performed a preliminary examination of its functionality. Reported SSIs tagged as

Student papers were most difficult to confirm as infractions. Turnitin does not make the

student papers available for inspection without permission from the person who had

submitted them to Turnitin. Moreover, when I attained the requested permission, I found

that students/authors often reused parts of their own previously submitted University

assignments in their dissertations. This practice is not plagiarism-of-self if they own the

copyright to their unpublished works. However, it may be against their university policy,

which is beyond the scope of this study. Furthermore, this practice is normal, as many

students are encouraged in their academic endeavors to select an area in their studies

and develop their expertise by authoring assigned essays and literature reviews.

Moreover, I found several examples where students had attended multiple schools and

used their prior research and writings repeatedly, thus creating many content similarity

issues. If the student or instructor had submitted these documents to Turnitin, it would

place the content in the separate student DCD, thus increasing the number of SSIs.

A serious logistics issue that affects corpus COA studies is the ability to track the

steps completed and record the plethora of resulting data. Spreadsheets are barely

adequate for tracking the processes and collected data. Researchers should use

database management skills in Corpus COA studies. Modern database management

systems have the ability to “drag and drop” develop user data input screens, and the

ability to query and export data from multiple tables in a database. This functionality

exceeds spreadsheet applications’ capabilities. However, database competencies are

skills a researcher may find difficult to learn.

That being said, I posit that software such as Turnitin should have the ability to

organize by project and save the corpus documents and reports, track all of the steps in

the process, save and display additional data that is critical to a COA, and provide an

accommodating query system.

Conclusions

While a review of the literature confirmed that plagiarism has been the focus of

many studies, ongoing research is still required. The literature does not indicate that

plagiarism levels have significantly dropped over the years. In fact, with the Internet,

plagiarism is incredibly easy and a time saver for both students and authors.

In response, this exploratory study documented a content originality analysis

(COA) of HRD-focused corpora of dissertations and published articles. Using Turnitin

COA software, this study identified and analyzed potential plagiaristic activity. Moreover,

this study documented the process of data collection and data analysis including

validating investigator-made adjustments to the initial Turnitin-reported results (see

Figures 2 and 8).

This study expanded the scope of COA descriptive statistics by using Turnitin-

reported variables, researcher-derived variables, and document metadata such as the

year of publication, research method, author counts, word counts, reference counts, and

subpopulation variables when applicable. However, I posit that the most important

descriptive statistic in a corpus plagiarism study is the DSL group membership based

upon the aDSI. A three-level DSL used in this study was engineered upon a foundation

of prior research studies. The aDSI values assigned membership in the low, high, or

excessive levels of the DSL. This study examined 360 dissertations and 360 published

articles and returned 88.1% of the dissertations in the low level of plagiarism, 9.7% in

the high and 2.2% in the excessive level, while 79.2% of the published articles were in

the low level of plagiarism, 12.8% in the high and 8.1% in the excessive level. The

differences in the rates between the two corpora can primarily be explained by the

incidences of plagiarism-of-self that were absent in the dissertations and present in the

published articles. Upon further examination, the aDSI means between the dissertations

and published articles of the low DSL (level 1) is .07 versus .08 (a positive 14%

difference). The high DSL (level 2) is .18 versus .19 (a positive 5% difference). The

excessive DSL (level 3) is .30 versus .36 (a positive 20% difference). The conclusions

that one can draw from the descriptive results from corpus plagiarism studies is that

these statistics alert us to ongoing plagiaristic activity within academic institutions,

publishers, and specific knowledge domains.

While I found little evidence that prior research had statistically identified the

importance of detecting and documenting false positives, the handling of false positives

proved critical to the results any investigator uncovered. For example, Dee and Jacob

(2012) analyzed 1279 student submissions. After a preliminary examination, they

reported 50% of the assignments had false positives. However, as with many studies I

reviewed, there were few details about the processes used for detecting these false

positives, nor any statistical and practical significance to what the identified false

positives played. In an attempt to define a previously undefined process such as in

Dee's and Jacob's (2012) work, I provided detailed process descriptions and evidence

of statistical and practical significance in defining the difference between the Turnitin

DSI and my aDSI. The dissertations started with a DSI of 27% and ended with an aDSI

of 9% or an 18% difference. The published articles started with a DSI of 67% and ended

with an aDSI of 11% or a 56% difference. One must conclude that an investigator

adjustment process is just too important to omit. Moreover, measuring its statistical and

practical significance adds an important metric to measure the accuracy of COA

studies.

Another important part of this study was identifying the difficulty for plagiarism

prediction. Using a multinomial logistic regression statistic, I found that there was a

statistically significant amount of prediction between the number of document

references and DSL membership. However, the interpretation of this finding is that

instructors and editors can assume documents with high reference counts would have

higher plagiarism issues. My research demonstrated that higher reference counts often

mask numerous cited paraphrasing issues lacking quotation markings.

The experience I gained from this study and the review of the literature leaves

me to believe that plagiarism prediction is still in its infancy. However, as our society

collects more and more data on individuals, there exists the potential to more accurately

predict plagiaristic based upon what may seem like unrelated data.

Implications

Academic and Publishing Domains

There is an overwhelming need for the academic and the publishing professions

to develop a set of best-practice strategies for use in future corpus plagiarism studies.

Zhang and Jia (2012) surveyed editors on what they perceived as troubling levels of

plagiarism and found that the

The majority of respondents indicated that if between one-quarter [25%] and one-

third [33%] of the content in the abstract, introduction or discussion is copied

without citation, the paper is likely to be rejected. (pp. 296-297)

Masic (2012) posited that if 25% or more of an article is not original, a publisher

should take remedying action. Samuelson (1994) reported that her colleagues used a

30% acceptance rule for plagiarism-of-self. Given these variations, it is clear that more

work needs to be completed. Most important, the profession should enact a set of

commonly accepted plagiarism levels. How institutions use these levels will be a matter

of individual preferences based upon publication goals and publisher resources.

However, academia and the publishing industry need to agree on some measurement

system. While the stated plagiarism goal is often defined as zero tolerance, the practical

application could utilize a three level plagiarism model related to the manuscript

submission outcomes of 1) accepted, 2) accepted with editing, and 3) rejected.

Professors and Instructors

While this study did not study plagiarism issues for undergraduate students, it

provided a basis for teaching all students about the science of plagiarism detection and

prevention. The key seems to be the need to keep the plagiarism discussion in the

forefront. Bailey (2016) writes:

Why don’t many students understand plagiarism? It’s simply because they were

never taught citation. Caught between instructors who thought it was “too early”

or “too late” to teach citation, they never really learned the art and never had its

importance impressed upon them. (p. 6)

Dee and Jacob (2012) also emphasized the importance of teaching students

about plagiarism prevention and from their research concluded that:

Our results demonstrate that a short educational tutorial can sharply reduce the

prevalence of plagiarism. The costs of this intervention are quite modest,

suggesting it could be scaled easily. It involves very little instructor involvement,

requires only 15 minutes on the part of students and the tutorial itself is freely

available. Moreover, our evidence suggests that the intervention has the largest

impact on lower-ability students, which may make it even more beneficial at a

wide range of public and private institutions with less selective admissions than

the highly selective institution we study. (p. 427)

Applying the research from Bailey (2016), and Dee and Jacob (2012) I instituted

the following learning aid for use in an undergraduate class, prefacing course essay

assignments:

How do you write without creating plagiarism issues? EASY!

• Read all your sources that apply to your essay. Take notes.

• Create your title and subheadings. Create the reference section in your paper

(You will continually need to edit the list by adding to it and deleting unused

ones.)

• Now write your article in your own words with your understanding of the

materials without directly looking at your sources. (DO NOT COPY, PASTE

and EDIT)

• Next, cite any ideas you found in your sources that are not yours or more

importantly, belong to others. You do not have to cite every sentence.

• If you want to add some exact quotes, cite and include a page number.

Example 1: Mayes (####) stated that "Small quotes are just placed within the

text using double quote marks" (p.45).

• Example 2: Mayes (####) also added that:

Large quotes, such as a whole paragraph, need to be separated out and

indented. Double quote marks are not needed. However, do not go

overboard on quotes. You want the reader to know you are the author and

not just copying the bulk of the text. (p. 46)

• If you use an image, you should list its origin and cite it.

• Note: You can cite a single reference several times.

A most effective way for students to become involved in plagiarism prevention is,

if possible have students submit their writings to Turnitin for a COA report. The process

leaves a student very aware of the issues and provides ample deterrent regarding

plagiarism. This study added to what other studies have posited, that under active

supervision students and authors avoid plagiaristic activity if they understand the

nuances of plagiarism. More important, their professors and instructors must lead by

attitude and example.

Researchers and Authors

While plagiarism-of-others is an institutional matter often driven by tenure-related

publishing counts, one has to recall what Rubio (2013) reported regarding a Columbian

Supreme Court case where a professor was sentenced “to two years in prison plus

monetary and civil sanctions for plagiarizing a student’s thesis” to fully understand the

seriousness of plagiarism-of-others (p. 141).

Regarding plagiarism-of-self, this study provided evidence of substantive

plagiarism-of-self SSI in published articles that explains the overall difference between

the aDSI statistics between published articles and dissertations. However, while there is

much discourse about the implications of plagiarism-of-self issues, given what

happened to Schminke and Ambrose (2014), an author, even a very renowned author,

cannot simply dismiss plagiarism-of-self matters. Plagiarism-of-self inflates the

importance of work and can exaggerate the contribution to a science. It is important that

a researcher and author notify readers of previous works and get copyright releases

when needed. Moreover, researchers and authors need to come to terms on what are

the ethical definitions for adjudicating plagiarism-of-self issues.

Reviewers and Publishers

Publishers control the ethics of plagiarism by the collective determination of what

is acceptable for publication. However, publishers need an industry-wide defined

process for detecting plagiarism. Plagiarism detection efforts cannot be discriminatory

and must be supported by policy and procedures that are verifiably consistent and

accurate. If they use peer reviewers to search for plagiarism, the publishers need to

provide the reviewers with tools, as well as the training, and the needed support when

plagiarism is detected. Most important, any adjudication of plagiaristic manuscripts must

include input from the author(s). It is most important for the author(s) to be given the

opportunity to defend themselves.

COA Software Providers

This study demonstrated that plagiarism detection software is only a tool that a

researcher can use to investigate potential evidence of plagiarism. Turnitin was

primarily engineered for examining unpublished documents that have not yet had the

chance to spread throughout the Internet. During the execution of this research, it

became apparent for published corpus research that Turnitin required some additional

features integrated into its design. Below is a list of suggested features and justifications

for the suggested additional feature requirements.

1. Show year of publication or Turnitin acquisition date with each listed similarity

source. It is critical to know whether the test document or the source document

was published first.

2. Display article title, authors, and publish date from internet sources as Turnitin

does with publications. This information is essential in a COA.

3. Be able to exclude headers and footers. Headers and footers are often publisher

generated across a vast number of documents thus leading to false positives that

are difficult to eliminate.

4. Be able to exclude selected text from the report. Eliminating selected text from

the report is critical for eliminating trivial and template false positives across all

sources.

5. Be able to exclude document title pages, copyright pages, abstracts, and

dedications. Common research documents and surveys, and institution and

publisher agreements generate false positives that are difficult to eliminate, i.e.,

set page ranges for content analysis at the COA report interface.

6. Be able to exclude document reference lists and appendices, i.e., set page

ranges for content analysis at the COA report interface.

7. Allow for a search of the DCD text stream for finding author name, Copyright

notice, etc. The DCD text stream is difficult to navigate, read, and interpret. The

ability to search specific words would help with a researcher's analysis.

8. Allow for copying of the text DCD stream. When material similarities occur, proof

must be documented to assist with a plagiarism resolution. Moreover, data

streams can change leaving once available evidence missing.

9. Allow for global exclusion configurations of common internet sources like

www.reseachgate.com, www.coursehero.com, www.DocStop.com,

www.Proquest.com, www.gradworks.umi.com, www.thefreelibarary.com and

www.slideshare.net. These websites often publish existing documents, in full or

in part, creating large and frequent false positives.

10. Provide counts on total exclusions with total percent excluded. It is important to

know how many false positives have been removed and what percent is

attributed to each excluded source. Moreover, it will be easier for a researcher to

restore exclusions for a re-investigation.

11. Provide an option to exclude all sources (sub-sources) under a certain

percentage, including those less than 1%. Hundreds of small similarity sources

often consisting of trivial and template similarities lead to false positives and are

difficult to eliminate. A blanket exclusion would free up resources for a more

thorough investigation on the larger similarities.

12. Option to exclude by ratio the number of source similarities that will not be

examined. Using an interpolation calculation, Turnitin can be set to estimate the

potential exclusions by the ratio of confirmed exclusions to the starting document

similarity index.

13. The exclusion “Restore All” button should provide the summative number and

percent of the excluded SSI that would be restored.

Future Research

While it is important to test the validity (replicability and accuracy) of this study,

researchers should also look to improve on what this study has already offered. I

encourage others to follow-up on this study's techniques and the analysis processes

employed, as well as the results. Because this study only covered dissertations and

published articles, additional work is required. Covering student plagiarism and cheating

is a continual challenge. Changing technologies and attitude shifts across students and

faculty require new research and a revisiting of prior research of plagiarism and

plagiarism prevention.

Future studies can be undertaken to determine if the processes and procedures I

documented to remove false positives are applicable with other plagiarism packages.

Furthermore, while I provided a short synopsis of several plagiarism detection

packages, it is important to see how the Turnitin results from this study would compare

using other packages. These studies could compare features, results, and usefulness of

competing plagiarism detection packages.

Additional COA studies can be used to compare results across different

professions and cultures. These studies can look for variations in the processes and the

results. Moreover, COA studies on unpublished documents could be compared to

published documents noting the statistical differences in needed adjustments and

results.

Another area needing review is the calculated final adjustment. This study use a

formula (see pages 64-65) to exclude a portion of the SSI less than 1% based upon a

ratio of previously excluded SSI. A random sample of the SSI less than 1% could be

used to verify the broader application of the formula across all SSI smaller than 1%.

This study defined substantive SSI as have similarity index of 5% or more.

Research should examine that percentage and verify the practicality of the 5%

threshold. Further research could examine the prospect that a word count threshold

would be more appropriate.

Additional validation research is needed regarding the usefulness of the set of

DSLs used in this study. My DSLs were based upon merging several existing levels

from other similar studies. However, my document similarity levels requires more

research to affirm that these levels are appropriate and could serve as a valuable set of

metrics in the academic research and publishing fields. A greater understanding of what

would be consider low, high, or excessive plagiarism is important. These studies could

use respected professors and editors to review sample documents that fit within the

three levels. They could identity what level best described the aDSI from these samples.

Either a quantitative or qualitative analysis of recorded impressions could potentially

confirm or adjust the DSLs with the most appropriate aDSI ranges.

Most important, the goal of plagiarism science must be the prevention of

plagiaristic activities. Further research can examine which educational techniques are

the most effective in reducing student plagiarism using various plagiarism detection

methodologies and preemptive intervention strategies. The publishing profession must

also identify and rank the best practices for prevention strategies based upon replicable

submission results.

In closing, our profession cannot overlook the importance of recognizing the vast

amount of original work that is the norm and avoid focusing on the vindictiveness of

punishing the ignorant. Plagiarism preventative measures will help our industry more

than punitive punishments. It is important to recognize that content originality is more

important than content similarity and that education is the key.

APPENDIX A

DISSERTATION FREQUENCY DETAIL TABLES

DSL Frequency Table for Dissertations with Research Method Subpopulation (n=360)

Document Low Similarity Levels 0%-14% 15%-24% 25%-100%

Freq. % Freq. % Freq. %

Quantitative 110 34.7% 22 62.8% 6 75.0%

Qualitative 149 47.0% 10 28.5% 1 12.5%

Other 58 18.3% 3 8.7% 1 2.5%

Descriptive Statistics for Dissertations Subpopulation (n=360)

Number of Wordsa Freq. Percent aDSI SD aDSW SD

10000 2 1% .12 0 1687 152 20000 30 8% .13 7.2 2850 1635 30000 82 23% .10 .7 2945 2114 40000 77 21% .09 .05 3640 2141 50000 60 17% .09 .05 4330 2929 60000 37 10% .05 3.8 3167 2161 70000 27 8% .07 .05 4737 3796 80000 18 5% .07 .06 5929 4797 90000 9 3% .05 .03 4769 3052

100000 6 2% .06 .05 6273 4584 110000 6 2% .07 .05 7663 5119 120000 2 1% .02 .03 4875 2438 130000 3 1% .05 .02 7225 2886 140000 0 0 .00 0 0 0 150000 0 0 .00 0 0 0 160000 1 <1% .03 0.0 4662 4662 aThe document word counts were rounded to the nearest ten thousand.

Descriptive Statistics for Published Articles Subpopulation (n=360)

Number of Referencesa Freq. Percent aDSI SD aDSW SD

50 62 17% .08 .07 2778 2420 100 116 32% .09 .05 3330 1927 150 93 26% .08 .06 3690 2512 200 51 14% .10 .07 4917 3265 250 18 5% .09 .06 5024 3228 300 9 3% .07 .05 4922 3660 350 7 2% .07 .05 5396 4439 400 1 <1% .11 .00 11917 0 450 2 1% .20 .09 17178 3977 500 0 0 .00 .00 0 0 550 0 0 .00 .00 0 0 600 1 <1% .07 .00 8872 0

aThe document references counts were rounded to the nearest fifty.

APPENDIX B

PUBLISHED ARTICLES FREQUENCY DETAIL TABLES

DSL Frequency Table for Published Article with Research Method Subpopulation (n=360)

Document Low Similarity Levels 0%-14% 15%-24% 25%-100%

Freq. % Freq. % Freq. %

Quantitative 91 31.9% 17 37.0% 14 48.3%

Qualitative 71 24.9% 10 21.7% 2 6.9%

Other 123 43.2% 19 41.3% 13 44.8%

Number of Wordsa Freq. Percent aDSI SD aDSW SD

10000 333 92% .12 .10 865 885 20000 27 8% .08 .10 1337 1546 aThe document word counts were rounded to the nearest ten thousand.

Number of Referencesa Freq. Percent aDSI SD aDSW SD

50 301 83% .11 .10 767 785 100 52 14% .12 .11 1435 1094 150 3 1% .20 .13 2684 1374 200 2 1% .25 .32 4148 5198 250 2 1% .05 .03 1160 602 aThe document references counts were rounded to the nearest fifty.

APPENDIX C

SPSS AND R SYNTAX

Download the CSV data, SAV data, SPSS syntax, and R Syntax files used in my study.

Zip File Download: Supporting CSV and SAV Data files, SPSS Syntax and R Syntax

CSV Files

• 360 Dissertation Corpus CSV file

• 360 Published Article Corpus CSV File

• 720 Combined Corpus CSV File

• 181 Substantive SSI Details CSV File

SPSS Data Files

• 360 Dissertation Corpus SAV file

• 360 Published Article Corpus SAV File

• 720 Combined Corpus SAV File

• 181 Substantive SSI Details SAV File

SPSS Syntax Files

• SPSS CSV Import to SAV Syntax SPS File

• SPSS COA Analysis SPS File

• SPSS Check Overs SPS File

• SPSS Charts and Graphs Syntax SPS File

R-Studio Syntax Files

• R-Studio Charts and Graphs Syntax R File

APPENDIX D

CSV FILES FIELD DESCRIPTIONS

Files: CSV_CORPUS-20171222.CSV CSV_DIS-20171222.CSV CSV_ART-20171222.CSV

dcmDOCNBR SRT System Generated Unique Document identifier

dcmDOCTYP 1=Dissertation 2=Published Article

dcmYEARMO Posted Year of Publication

dcmDOCCLS Document Class (Research Method=1,2,3, or 4)

dumQUANT Quantitative = 1 else 0

dumQUALT Qualitative = 1 else 0

dumMIXED Mixed = 1 else 0

dumOTHER Other = 1 else 0

dcmATHCNT Author Count

dcmPGECNT Page Count

dcmWRDCNT Word Count

wrkWRDCNT Word Count /10,000 and Rounded

dcmREFCNT Reference Count

wrkREFCNT Reference count /50 and Rounded

dcmGRSDSI Turnitin Reported DSI BEFORE any Adjustments or Exclusions

wrkGRSWRD Words based upon the Turnitin Reported DSI

dcmEXCDSI Turnitin Reported DSI AFTER any Adjustments or Exclusions

sumOTHCNT Number of POO Substantive SSI for this document

sumOTHAMT Average % for POO Substantive SSI for this document

sumOTHWRD Average similarity words for POO Substantive SSI for this document

sumSLFCNT Number of POS Substantive SSI for this document

sumSLFAMT Average % for POS Substantive SSI for this document

sumSLFWRD Average similarity words for POS Substantive SSI for this document

dcmMAXSSI Maximum SSI % found in this document

dcmFRQSSI Frequency of SSIs larger than 1% in this document

dcmSUMSSI Sum of SSI equal to and over 1% in this document

wrkADJDSI The Adjusted DSI (VERY IMPORTANT)

wrkADJWRD The adjusted Word similarities

wrkDIFDSI Difference between the dcmGRSDSI and the wrkADJDSI

wrkCOALVL The DSL category (1 = low, 2 = high, 3 = Excessive)

wrkLVLPCT Average group percent for the DSL in this Document's Level.

CSV_SSI-20170327.CSV

ssiDOCNBR SRT System Generated Unique Document identifier

ssiDOCTYP 1=Dissertation 2=Published Article

ssiDTLNBR SRT System Generated Detail Line Identifier when used with ssiDOCNBR

ssiSSITYP POO = Plagiarism-of-other POS = Plagiarism-of-self

ssiSSIAMT Substantive SSI Percentage of Similarity

ssiSSIWRD Substantive SSI Words that are Similar

APPENDIX E

TURNITIN COA REPORT (NO ADJUSTMENTS)

REFERENCES

Adhikari, N. (2010). Avoiding plagiarism and self-plagiarism. Journal of Nepal Pediatrics

Society, 30, 77-78.

Allen, G. N., Ball, N. L., & Smith, H. J. (2011). Information systems research behaviors:

What are the normative standards?. MIS Quarterly, 35(3), 533-A26.

Allison, P. (2013). What’s the best R-squared for logistic regression? Statistical

Horizons. Retrieved from http://statisticalhorizons.com/r2logistic

Amos, K. (2014). The ethics of scholarly publishing: Exploring differences in plagiarism

and duplicate publication across nations. Journal of the Medical Library

Association, 102(2), 87-91.

Anderson, J. A. (1984). On the existence of maximum likelihood estimates in logistic

regression models. Biometrika, 71(1), 1-10.

APA. (2010). Publication manual of the American Psychological Association (6th ed.).

American Psychological Association. Washington, DC: Author.

Auer, N. J., & Krupar, E. M. (2001). Mouse click plagiarism: The role of technology in

plagiarism and the librarian’s role in combating it. Library Trends, 49, 415-432.

Bailey, J. (2016). Why do students not understand plagiarism? Plagiarism Today.

Retrieved from https://www.plagiarismtoday.com/2016/09/01/why-dont-students-

understand-

plagiarism/?utm_source=Plagiarism+Today+Newsletter&utm_campaign=4bacb6f

aa7-RSS_EMAIL_CAMPAIGN&utm_medium=email&utm_term=0_643f84ace3-

4bacb6faa7-412694437

Baker, M. (2015). Smart software spots statistical errors in psychology papers.

Nature.com. Retrieved from http://www.nature.com/news/smart-software-spots-

statistical-errors-in-psychology-papers-1.18657

Barrett, R., & Malcolm, J. (2005). Embedding plagiarism education in the assessment

process. International Journal of Educational Integrity, 2(1), 38–45.

Batane, T. (2010). Turning to Turnitin to fight plagiarism among university students.

Educational Technology & Society, 13 (2), 1-12.

Bertolucci, J. (2013). Big data analytics: Descriptive vs. predictive vs. prescriptive.

Information Week. Retrieved from http://www.informationweek.com/big-data/big-

data-analytics/big-data-analytics-descriptive-vs-predictive-vs-prescriptive/d/d-

id/1113279?piddl_msgorder=thrd#msgs

Bedeian, A. G. (2014). "More than meets the eye": A guide to interpreting the

descriptive statistics and correlation matrices reported in management research.

Academy of Management Learning & Education, 13(1), 121-135.

doi:10.5465/amle.2013.0001

Bennett, K. K., Behrendt, L. S., & Boothby, J. L. (2011). Instructor perceptions of

plagiarism: Are we finding common ground?. Teaching of Psychology, 38, 29-35.

Berry, W. D., & Feldman, S. (1985). Multiple regression in practice. SAGE university

paper series on quantitative applications in the social sciences, 07-050. Beverly

Hills, CA. SAGE.

Bettencourt, L. A., & Houston, M. B. (2001). The impact of article method type and

subject area on article citations and reference diversity in JM, JMR, and JCR.

Marketing Letters, 12(4), 327-340.

Bewick, V., Cheek, L., & Ball, J. (2005). Statistics review 14: Logistic regression. Critical

Care, Vol 9 No 1. Retrieved from:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1065119/pdf/cc3045.pdf

Bham, G. H., Javvadi, B. S., & Manepalli, U. R. (2012). Multinomial logistic regression

model for single-vehicle and multivehicle collisions on urban U.S. highways in

Arkansas. Journal of Transportation Engineering, 138(6), 786-797.

doi:10.1061/(ASCE)TE.1943-5436.0000370

Braumoeller, B. F., & Gaines, B. J. (2001). Actions do speak louder than words:

Deterring plagiarism with the use of plagiarism detection software. Political

Science & Politics, 34(4), 835 – 839.

Bretag, T., & Mahmud, S. (2009b). A model for determining student plagiarism:

Electronic detection and academic judgment. Journal of University Teaching &

Learning Practice, 6(1).

Broome, M., Dougherty, M. C., Freda, M. C., Kearney, M. H., & Baggs, J. G. (2010).

Ethical concerns of nursing reviewers: An international survey. Nursing Ethics,

17, 741-748. doi:10.1177/0969733010379177

Burns, R. P. & Burns, R. (2009). Business Research Methods and Statistics Using

SPSS. Sage Publications Ltd., Los Angeles. CA.

Cabral-Cardoso, C. (2004). Ethical misconduct in the business school: A case of

plagiarism that turned bitter. Journal of Business Ethics, 49(1), 75-89.

doi:10.1023/B:BUSI.0000013864.76547.d5

Callahan, J.L. (2014). Creation of a moral panic? Self-plagiarism in the Academy.

Human Resource Development Review, Vol. 13(1) 3– 10.

doi:10.1177/1534484313519063

Chao, C., Wilhelm, W. J., & Neureuther, B. D. (2009). A study of electronic detection

and pedagogical approaches for reducing plagiarism. Delta Pi Epsilon Journal,

51, 31-42.

Cheema, Z., Mahmood, S., Mahmood, A., & Shah, M. (2011). Conceptual awareness of

research scholars about plagiarism at higher education level: Intellectual property

right and patent. International Journal of Academic Research, 3, 666-671.

Cheung, M., & Driver, D. (2004). Self-plagiarism as a social work concern. Hong Kong

Journal of Social Work, 38, 3-13.

Chronicle of Higher Education. (2002). Corruption plagues academe around the world.

Chronicle of Higher Education. Retrieved from

http://chronicle.com/article/Corruption-Plagues-Academe/16544

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale,

NJ: Erlbaum.

COPE. (2014). Promoting integrity in research publication. Committee on Publication

Ethics. Retrieved from http://publicationethics.org/cases

Cramer, D. (1998). Fundamental statistics for social research. London: Routledge.

Cramer, D., & Howitt, D. (2004). The SAGE dictionary of statistics. London: SAGE.

Creswell, J.W. (2014). Research Design (4th ed.). Sage Publications, Los Angeles. CA.

Cromwell, D. (2012). Punishing the pen with the sword?: Columbia's new, extreme, and

ineffective punishment for plagiarism. Pacific Rim Law & Policy Journal

Association.

David, W. (2011). A flap in Germany risks a storm in Spain. The Times, (United

Kingdom), 25.

Decullier, E., Huot, L., Samson, G., & Maisonneuve, H. (2013). Visibility of retractions: A

cross-sectional one-year study. BMC Research Notes, 6238. doi:10.1186/1756-

0500-6-238

Dee, T. S., & Jacob, B. A. (2012). Rational ignorance in education: A field experiment in

student plagiarism. Journal of Human Resources, 47, 397-434.

DeGeeter, M., Harris, K., Kehr, H., Ford, C., Lane, D., Nuzum, D., & Gibson, W. (2014).

Pharmacy students' ability to identify plagiarism after an educational intervention.

American Journal of Pharmaceutical Education, 78(2), 1-6.

EBSCO. (2016a). How is relevance ranking determined in EBSCO Discovery Service

(EDS)?. EBSCO Support. Retrieved from

http://support.epnet.com/knowledge_base/detail.php?id=3971

EBSCO. (2016b). Which fields can be searched using the "Select a Field" drop-down

list?. EBSCO Help. Retrieved from

https://help.ebsco.com/interfaces/EBSCO_Guides/General_Product_FAQs/fields

_searched_using_Select_a_Field_drop_down_list

Elbeck, M. (2009). Exploring the murky waters of self-plagiarism. Journal for

Advancement of Marketing Education, 14, 41-51.

ERIC. (2017). Online Submission Frequently Asked Questions. US Department of

Education. Retrieved from: https://eric.ed.gov/?submitfaq

Fahy, P. J. (2013). Uses of published research: An exploratory case study. International

Review of Research in Open and Distance Learning, 14(1), 145-166.

Fang, F. C., Steen, R., & Casadevall, A. (2012). Misconduct accounts for the majority of

retracted scientific publications. PNAS Proceedings of the National Academy of

Sciences of the United States of America, 109, 17028-17033.

doi:10.1073/pnas.1212247109

Farthing, M. G. (2006). Authors and publication practices. Science & Engineering

Ethics, 12, 41-52.

Field, A. (2011). Discovering Statistics using SPSS. Thousand Oaks, CA, US: SAGE

Publications Inc.

Fournier, G. (2016). Replicability. Psych Central. Retrieved from

https://psychcentral.com/encyclopedia/replicability/

Gibelman, M., & Gelman, S. R. (2003). Plagiarism in academia: Trends and

implications. Accountability in Research: Policies & Quality Assurance, 10(4),

229-252.

Gipp, B., Meuschke, N., & Breitinger, C. (2014). Citation-based plagiarism detection:

Practicability on a large-scale scientific corpus. Journal of The Association For

Information Science & Technology, 65(8), 1527-1540.

Grace-Martin, K. (2016). Outliers: To drop or not to drop. The Analysis Factor. Retrieved

from http://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/

Grace-Martin, K. (2017). Explaining Logistic Regression Results to Non-Statistical

Audiences. The Analysis Factor. Retrieved from

http://www.theanalysisfactor.com/explaining-logistic-regression/

Grammarly. (2017). Plagiarism Detection. PremiumGrammary. Retrieved from:

https://app.grammarly.com/

Greenbird. (2009). The Urban Dictionary [blog post] Retrieved from

http://www.urbandictionary.com/define.php?term=Reverse+Plagiarism

Halupa, C. M. & Bolliger, D. U. (2013). Faculty perceptions of student self-plagiarism:

An exploratory multi-university study. Journal of Academic Ethics, 11, 297-310.

doi:10.1007/s10805-013-9195-6

Halupa, C. M. (2014). Exploring student self-plagiarism. International Journal of Higher

Education, 3(1), 121-126.

Handa, S. (2008, July). Plagiarism and publication ethics: Dos and don'ts. Indian

Journal of Dermatology, Venereology & Leprology. pp. 301-303.

Hatch, T., & Skipper, A. (2016). How much are PhD students publishing before

graduation?: An examination of four social science disciplines. Journal of

Scholarly Publishing, 47(2), 171-179.

Heather, J. (2010). Turnitin: Identifying and fixing a hole in current plagiarism detection

software. Assessment & Evaluation in Higher Education, 35(6), 1-15.

DOI:10.1080/02602938.2010.486471

Heckler, N. C., Rice, M., & Bryan, C. (2013). Turnitin systems: A deterrent to plagiarism

in college classrooms. Journal of Research on Technology In Education

(International Society For Technology In Education), 45(3), 229-248.

Hege, A. C. (2008). The effect of affective state on inadvertent plagiarism. University of

Virginia, ProQuest Dissertations Publishing, 3312170.

Hendee, W. R. (2007). A concern about plagiarism. Journal of Medical Physics, 32(4),

143-144.

Higher Education Commission, Pakistan. (n.d.). How to interpret originality report

(Guidelines). Retrieved from

http://www.hec.gov.pk/InsideHEC/Divisions/QALI/QADivision/Documents/Guideli

nes%20on%20Turnitin.pdf

Hill, J., & Page, E. (2009). An empirical research study of the efficacy of two plagiarism-

detection applications. Journal of Web Librarianship, 3(3), 169-181.

Hodges, J., & Lehmann, E. (1956). The efficiency of some nonparametric competitors of

the t test. Annals of Mathematical Statistics, 27, 324-335.

Honig, B., & Bedi, A. (2012). The fox in the hen house: A critical examination of

plagiarism among members of the Academy of Management. Academy Of

Management Learning & Education, 11(1), 101-123.

doi:10.5465/amle.2010.0084

Horner, J., & Minifie, F. D. (2011). Research ethics III: Publication practices and

authorship, conflicts of interest, and research misconduct. Journal of Speech,

Language, and Hearing Research, 54, S346-S362.

Howell, D.C. (2010). Statistical methods for psychology. Belmont, CA, US: Cengage

Wadsworth.

Hulten, Nicholls, Winslet, & Kmiot. (2000). The Committee on Publication Ethics

(COPE) Guidelines. Colorectal Disease, 2, 247-248.

Humphreys, L., & Klein, A. (2006). Measuring online social support: Can computer-

based text analysis approximate Burleson’s Person-Centered Hierarchy?.

Conference Papers -- International Communication Association, 1-29.

IDRE-UCLA. (2017A). Annotated SPSS output multinomial logistic regression. Institute

for Digital Research and Education. Retrieved from

http://www.ats.ucla.edu/stat/spss/output/mlogit.htm

IDRE-UCLA. (2017B). FAQ: What are pseudo R-squareds?. Institute for Digital

Research and Education. Retrieved from

http://www.ats.ucla.edu/stat/mult_pkg/faq/general/Psuedo_RSquareds.htm

iParadigms. (2011). Turnitin Instructor User Manual Chapter 2: Originality Check.

Turnitin Instructor Handbook, pp.48-60. Retrieved from

https://turnitin.com/static/resources/documentation/turnitin/training/Instructor_Ori

ginality_Report_Chapter_2.pdf

Ison, D. I. (2012). Plagiarism among dissertations: Prevalence at online institutions.

Journal of Academic Ethics, 10(3), 227-236. doi:10.1007/s10805-012-9165-4

Ison, D. I. (2014). Does the online environment promote plagiarism? A comparative

study of dissertations from brick-and-mortar versus online institutions. Journal of

Online Learning & Teaching, 10(2), 272-281.

iThenticate. (n.d.). 6 Consequences of Plagiarism. iThenticate. Retrieved from

http://www.ithenticate.com/resources/6-consequences-of-plagiarism

James, R., McInnis, C., & Devlin, M. (2002). Plagiarism detection software: How effective is

it?. Assessing Learning in Australian Universities. Retrieved from

http://www.cshe.unimelb.edu.au/assessinglearning/docs/PlagSoftware.pdf

Jent, H. C. (1967). Inverse plagiarism. Education Digest, 33(1), 32-34.

Jocoy, C., & DiBiase, D. (2006). Plagiarism by Adult Learners Online: A Case Study in

Detection and Remediation. International Review of Research In Open And

Distance Learning, 7(1), 1-15.

Jones, O. K. (2008). Practical issues for academics using the Turnitin plagiarism

detection software. Paper presented at the International Conference on

Computer Systems and Technologies, CompSysTech. O8, Bulgaria. Retrieved

from http://dl.acm.org/citation.cfm?id=1500935

Karabag, S., & Berggren, C. (2012). Retraction, dishonesty and plagiarism: Analysis of

a crucial issue for academic publishing, and the inadequate responses from

leading journals in economics and management disciplines. Journal of Applied

Economics & Business Research, 2, 172-183.

Karlsson, J., & Beaufils, P. (2013). Legitimate division of large data sets, salami slicing

and dual publication: Where does a fraud begin?. Knee Surgery, Sports

Traumatology, Arthroscopy, 21, 751-752.

Keck, C. (2006). The use of paraphrase in summary writing: A comparison of L1 and L2

writers. Journal of Second Language Writing, 15, 261–278.

Klienfield, Z. (2014). Web tool Turnitin could help UC Berkley combat plagiarism. The

Daily Californian. Retrieved from http://www.dailycal.org/2014/02/18/campus-

discusses-possible-introduction-web-tool-turnitin-help-bolster-honor-code/

Knight, M. (2013). Editing as reverse plagiarism. Monty Knight Counseling. Retrieved

from http://www.drmontyknightcounseling.com/2013/08/22/editing-as-reverse-

plagiarism/

Kock, N. (1999). A case of academic plagiarism. Communications of the ACM, 42, 96-

Krejcie, R. V., & Morgan, D. W. (1970). Determining sample size for research

activities. Educational & Psychological Measurement, 30607-610.

Lee, K., Ahn, H., Moon, H., Kodell, R. L., & Chen, J. J. (2013). Multinomial logistic

regression ensembles. Journal of Biopharmaceutical Statistics, 23(3), 681-694.

doi:10.1080/10543406.2012.756500

Ling, S. (2006). Cultural backgrounds and textual appropriation. Language Awareness,

15(4), 264-282. doi:10.2167/la406.0

Liu, D. (2005). Plagiarism in ESOL students: Is cultural conditioning truly the major

culprit?. ELT Journal: English Language Teachers Journal, 59(3), 234-241.

doi:10.1093/elt/cci043

Marcus, S. & Beck, S. (2011). Faculty perceptions of plagiarism at Queensborough

Community College. Community & Junior College Libraries, 17, 63-73.

Marshall, T., Taylor, B., Hothersall, E., & Péérez-Martíín, L. (2011). Plagiarism: A case

study of quality improvement in a taught postgraduate programme. Medical

Teacher, 33(7), e375-e381. doi:10.3109/0142159X.2011.579201

Martin, D., Rao, A., & Sloan, L. (2011). Ethnicity, acculturation, and plagiarism: A

criterion study of unethical academic conduct. Human Organization. doi:

10.17730/humo.70.1.nl775v2u633678k6 Retrieved from

http://sfaajournals.net/doi/10.17730/humo.70.1.nl775v2u633678k6

Masic, I. (2012). Ethical aspects and dilemmas of preparing, writing and publishing of

the scientific papers in the biomedical journals. Acta Informatica Medica, 20, 141-

148. doi:10.5455/aim.2012.20.141-148

Massart, D.L., Smeyers-Verbeke, J., Capron, X., & Schlesierb, K. (2005). Visual

presentation of data by means of box plots. LC-GC Europe, 18(4) 215–218

Mawdsley, R. D. (2009). The tangled web of plagiarism litigation: Sorting out the legal

issues. B.Y.U. Education and Law Journal. 2009(2), 245-267. Retrieved from

http://digitalcommons.law.byu.edu/cgi/viewcontent.cgi?article=1259&context=elj

Mayes, R. (2016). Scholarly research tracker [computer software]. Available from

http://www.premierdatasoftware.com/srt.htm

Mayes, R. (2016). RandomSampler [computer software]. Available from

http://www.robinjamesmayes.com/projects.html

McMurtry, K. (2001). E-cheating: Combating a 21st century challenge. T.H.E. Journal,

29(4), 36-38, 40-41.

Moore, D.S. & McCabe, G. P. (1989). Introduction to the practice of statistics. New

York, NY, US: W H Freeman/Times Books/ Henry Holt & Co.

Moore. (2008, April 2). Star terminates local columnist for plagiarism. Ventura County

Star (CA).

Moten, A. R. (2014). Academic dishonesty and misconduct: Curbing plagiarism in the

Muslim world. Intellectual Discourse, 22(2), 167-189.

MSU. (n.d.). Plagiarism. Michigan State University: Office of the University

Ombudsperson. Retrieved from https://www.msu.edu/~ombud/academic-

integrity/plagiarism-policy.html

Mulcahy, S., & Goodacre, C. (2004). Opening Pandora’s box of academic integrity:

Using plagiarism detection software. Proceedings from ASCILITE Conference

2004, Perth, WA.

Mullen, M. R., Milne, G. R., & Doney, P. M. (1995). An International marketing

application of outlier analysis for structural equations: A methodological note.

Journal of International Marketing, 3(1), 45-62.

Nefzger, M. D., & Drasgow, J. (1957). The needless assumption of normality in

Pearson's r. American Psychologist, 12(10), 623-625. doi:10.1037/h0048216

Neville, C.W. & Wadler, P. (2005). Beware the consequences of citing self-plagiarism.

Communications of the ACM, 48(6), 13.

O’Connell, A. A. & Rivet Amico, K. (2010). Logistic Regression. In G. R. Handcock &

R. O. Mueller (Eds.), The Reviewer’s Guide to Quantitative Methods in the Social

Sciences. Routledge, New York, NY.

O'Connor, S. J. (2010, May). What do duplicate publications, self-plagiarism and the

monotony of endless descriptive studies signify: Publication pressures or simply

a collective lack of imagination? European Journal of Cancer Care. pp. 281-283.

doi:10.1111/j.1365-2354.2010.01192.x

Office of Inspector General. (2014). National Science Foundation Office of Inspector

General semi-annual report to Congress. Retrieved from

http://nsf.gov/pubs/2014/oig14002/oig14002.pdf

Olive, M. L., & Franco, J. H. (2008). (Effect) size matters: And so does the calculation.

The Behavior Analyst Today, 9(1), 5-10. doi:10.1037/h0100642

Olt, M. R. (2007). A new design on plagiarism: Developing an instructional design model

to deter plagiarism in online courses. Capella University, ProQuest Dissertations

Publishing, 3277651.

Onwuegbuzie, A. J. & Daniel, L. G. (2005). Editorial: Evidence-based guidelines for

publishing articles in research in the schools and beyond. Research in the

Schools, 12(2), 1-11.

Onwuegbuzie, A. J., & Daniel, L. G. (2003). Typology of analytical and interpretational

errors in quantitative and qualitative educational research. Current Issues in

Education, 6(2). Retrieved from https://cie.ed.asu.edu/volume6/number2/.

Orim, S. I., Davies, J. W., Borg, E., & Glendinning, I. (2013). Exploring Nigerian

postgraduate students' experience of plagiarism: A phenomenographic case

study. International Journal for Educational Integrity, 9, 20-34.

Perfect, T. H., Defeldre A., Elliman, R. & Dehon, H. (2011). No evidence of age-related

increases in unconscious plagiarism during free recall. Memory, 19(5), 514-528.

Petrucci, C. J. (2009). A primer for social worker researchers on how to conduct a

multinomial logistic regression. Journal of Social Service Research, 35(2), 193-

205. doi:10.1080/01488370802678983

Posner, R. A. (2007). The little book of plagiarism. New York, NY: Pantheon Books.

PT. (2011). PlagAware takes top honors in plagiarism checker showdown. Plagiarism

Today. Retrieved from https://www.plagiarismtoday.com/2011/01/13/plagaware-

takes-top-honors-in-plagiarism-checker-showdown/

Randles, R. H., & Wolfe, D. A. (1979). Introduction to the theory of nonparametric

statistics. New York: Wiley.

Regmi, K., & Naidoo, J. (2013). Understanding the processes of writing papers

reflectively. Nurse Researcher, 20(6), 33-39

Retraction Watch. (2014). SAGE Publications busts “peer review and citation ring,” 60

papers retracted. Retraction Watch. Retrieved from

http://retractionwatch.com/2014/07/08/sage-publications-busts-peer-review-and-

citation-ring-60-papers-retracted/

Rhodes, M., Gelman, S. A., & Brickman, D. (2008). Developmental changes in the

consideration of sample diversity in inductive reasoning. Journal of Cognition &

Development, 9(1), 112-143. doi:10.1080/15248370701836626

Robinson, S. R. (2014). Self-plagiarism and unfortunate publication: An essay on

academic values. Studies in Higher Education, 39(2), 265-277.

Roig, M. (2010). Plagiarism and self-plagiarism: What every author should know.

Biochemia Medica, 20, 295-300

Rosenthal, R. (1991). Meta-analytic procedure for social research (2nd ed.). Newbury

Park, CA: Sage.

Rosnow, R. L., & Rosenthal, R. (2005). Beginning behavioral research: A conceptual

primer (5th ed.). Englewood Cliffs, NJ: Pearson/Prentis Hall.

Rouse, M. (2014). Document metadata. Content Management. Retrieved from

http://whatis.techtarget.com/definition/document-metadata

Rubio, C. C. (2013). Colombia’s poetic world of authors' moral rights: Considerations on

imprisoning a professor for plagiarism. Pacific Rim Law & Policy Journal, 22(1),

141-155.

Samuelson, P. (1994). Self-plagiarism or fair use. Communications of the ACM, 37(8),

21-25.

Schminke, M., & Ambrose, M. L. (2014). Retraction statement for 'Ethics and integrity of

the publishing process: Myths, facts, and a roadmap'. Management and

Organization Review, 10:1, March 2014, 157–162

Schwab, J. A. (2002). Multinomial logistic regression: Basic relationships and complete

problems. http://www.utexas.edu/courses/schwab/sw388r7/SolvingProblems/

Shanmugam, A. (2009). Citation practices amongst trainee teachers as reflected in their

project papers. Malaysian Journal of Library & Information Science, 14(2), 1-16.

Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality (complete

samples). Biometrika, 52(3/4), 591-611.

Shaw, V. N. (2002). Counseling the university professor on the securing of research

grants and the publishing of research products. Education, 123, 395.

Shelley, C. (2005). "Stolen words": A brief history and analysis of preaching and

plagiarism. Encounter, 66, 301-316.

Shi, L. (2006). Cultural backgrounds and textual appropriation. Language Awareness,

15(4), 264-282. doi:10.2167/la406.0

Siaputra, I. (2013). The 4PA of plagiarism: A psycho-academic profile of plagiarists.

International Journal for Educational Integrity, 9, 50-59.

Sibbald, B. (2000). Guest editorial: COPE guidelines on good publication practice: An

author’s view. Health and Social Care in the Community, 8(6), 355–361.

Siegel, E. (2016/12/07). Analytics + business expertise = actionable predictions for each

customer. Prediction Impact. Retrieved from

http://www.predictionimpact.com/customer-prediction.html

Simon, M. K. (2011). Dissertation and scholarly research: Recipes for success (2011

ed.). Seattle, WA, Dissertation Success, LLC.

Spielmans, G. I., Biehn, T. L., & Sawrey, D. L. (2010). A case study of salami slicing:

Pooled analyses of duloxetine for depression. Psychotherapy and

Psychosomatics, 79(2), 97-106. doi:10.1159/000270917

Stevens, J. P. (2009). Applied multivariate statistics for the social sciences 5th Ed. New

York, NY: Routledge.

Sun, Y. (2013). Do journal authors plagiarize? Using plagiarism detection software to

uncover matching text across disciplines. Journal of English for Academic

Purposes, 12(4), 264-272. doi:10.1016/j.jeap.2013.07.002

Susser, M. & Yankauer, A. P. (1993). Prior duplicate, repetitive, fragmented, and

redundant publication and editorial decisions. American Journal of Public Health,

83, 792–794.

Sutherland-Smith,W. & Carr, R. (2005). Turnitin.com: teachers’ perspectives of anti-

plagiarism software in raising issues of educational integrity. Journal of University

Teaching and Learning Practice, 2(3b), 94–101

Swanson, R. A. (1995) Human resource development: Performance is the key. Human

Resource Development Quarterly, 6(2), 207 – 13.

Technavio, R. (2016). Top 8 vendors in the anti-plagiarism software market for

education from 2016 to 2020: Technavio. Business Wire (English). Retrieved

from http://www.businesswire.com/news/home/20161011005155/en/Top-8-

Vendors-Anti-Plagiarism-Software-Market-Education

Thomas, A., & de Bruin, G. P. (2014). Plagiarism in South African management

journals. South African Journal of Science, 111(1/2), 1-3.

doi:10.17159/sajs.2015/20140017

Thumwimon, S., & Takahashi, Y. (2010). A prospective process for implementing

human resource development (HRD) for corporate social responsibility (CSR).

Interdisciplinary Journal of Contemporary Research in Business, 2(1), 10-32.

Toulouse Graduate School. (2016). Preparation and filing of dissertations and theses

(Spring 2016 Revision). University of North Texas. Retrieved from

http://tsgs.unt.edu/thesis-manual

Tucci, V., & Galwankar, S. (2011). JETS policy on plagiarism and academic dishonesty.

Journal of Emergencies, Trauma & Shock, 4, 3-6. doi:10.4103/0974-2700.76818

Turmfalke, (2010). The Urban Dictionary. Retrieved from

http://www.urbandictionary.com/define.php?term=Reverse+Plagiarism

Turnitin. (2011). Plagiarism and the web: Myths and realities: White paper. iParadigms.

Retrieved from http://pages.turnitin.com/PlagiarismandtheWebHE.html

UMUC. (2016). Reading and Understanding Turnitin Originality Reports. Retrieved from

http://www.umuc.edu/library/libhow/turnitinoriginality_tutorial.cfm

U.S. Government. (2014). Inspector General Act of 1978, as amended. Retrieved from

http://www.ignet.gov/pande/leg/igactasof1010.pdf

Wager, E. & Williams, P. (2011). Why and how do journals retract articles? An analysis

of Medline retractions 1988-2008. Journal of Medical Ethics. 37, 567-570.

doi:10.1136/jme.2010.040964

Walker, J. (2010). Measuring plagiarism: researching what students do, not what they

say they do. Studies in Higher Education, 35(1), 41–59.

Weisgrau, S. L. (2011). Talking about brevity with high school seniors. Vocabula

Review, 13(7), 1-4.

White, R.T. & Arzi, H.J. (2005). Longitudinal studies: Designs, validity, practicality, and

value. Research in Science Education, 35, 137-149.

Willis, M. (2016). Why do peer reviewers decline to review manuscripts? A study of

reviewer invitation responses. Learned Publishing, DOI: 10.1002/leap.1006

Wilmington, T. (2013, November 21). Acrobat Reader Discussions, Adobe

Communities. Retrieved from

https://forums.adobe.com/thread/1110154?start=0&tstart=0

Wittmaack, K. (2005). Penalties plus high-quality review to fight plagiarism. Nature, 436,

24. doi:10.1038/436024d

Yank, V. & Barnes, D. (2003). Consensus and contention regarding redundant

publications in clinical research: Cross-sectional survey of editors and authors.

Journal of Medical Ethics, 29, 109–114.

Yentis, S. M. (2010). Another kind of ethics: From corrections to retractions editorial.

Anaesthesia. pp. 1163-1166. doi:10.1111/j.1365-2044.2010.06557.x

Youmans, R. J. (2011). Does the adoption of plagiarism-detection software in higher

education reduce plagiarism? Studies in Higher Education, 36, 749-761.

doi:10.1080/03075079.2010.523457

Zhang, Y., & Jia, X. (2012). A survey on the use of CrossCheck for detecting plagiarism

in journal articles. Learned Publishing, 25, 292-307. doi:10.1087/20120408

Zimmerman, D. W. (1996). An efficient alternative to the Wilcoxon signed-ranks test for

paired nonnormal data. Journal of General Psychology, 123(1), 29.

A Content Originality Analysis of HRD Focused .../67531/metadc... · in retractions over a ten-year...

Documents

Projection-like retractions on matrix manifolds

Originality - KahenaCon 2014

Originality report

Turnitin Originality Report

The Originality of the Avant-Garde and Other …designtheory.fiu.edu/readings/krauss_originality_sculpture.pdfThe Originality of the Avant-Garde and Other Modernist Myths ... The Originality

Retractions, Peer Review, and Transparency

Originality Test

Originality and creativity_project_powerpoint_blank

Super Foods Originality

Master of Business Administration (HRD) MBA (HRD) · PDF fileMaster of Business Administration (HRD) MBA (HRD) ... Principle of Organising, ... Accounting for Management, Vikas Publications

HRD 분야 이론활용 연구동향 분석 : 국내 HRD 주요 …...HRD 분야 이론활용 연구동향 분석: 국내 HRD 주요 학술지 논문(2007~2016년)을 중심으로

Question Bank · PDF fileHuman Resource Development (HRD) System Concept of HRD. Need ofHRD. Principles in designing HRD. HRD mechanisms. Unit 7

HRD Rig Control PDF - HRD Software

joseph kabila originality

Originality ict

Modular translations and retractions of numerical semigroups€¦ · Modular translations and retractions of numerical semigroups Aureliano M. Robles-Perez´ Universidad de Granada

NADEEM AMJAD, - un-csam.org · -Climate change -Resource and environment constraints such as water scarcity, land ... HRD HRD HRD HRD HRD HRD . Title: PowerPoint Presentation Author:

ORIGINALITY IN CONTEXT Introduction · ORIGINALITY IN CONTEXT* Introduction There is no universally applicable view of authorship, originality and creativity. Postmodernists argue

Declaration of Originality

Originality And The Apparatus Of Originality