Plagiarism Workshop Mike Joy University of Bath, 29 February 2012

Preview:

Citation preview

Plagiarism WorkshopMike Joy University of Bath, 29 February 2012

Emergency exits

Fire alarm

Toilets

Certificates of attendance

2

Administrative Issues

1.30 Introduction1.50 What is plagiarism?2.00 Our experiences2.20 Text plagiarism2.30 Computing and mathematics2.50 Why do students plagiarise?3.00 How do students plagiarise?3.15 Break3.30 Detection strategies and tools3.45 Prevention strategies and university process4.00 Discussion and conclusion4.15 End

3

Timetable

What is Plagiarism?

“The action or practice of taking someone else's work, idea, etc., and passing it off as one's own; literary theft” (OED Online, 2012)

“To commit literary theft; to present as new and original an idea or product derived from an existing source” (Merriam-Webster Online, 2012)

These definitions are open to interpretation.

What about equations, computer programs, etc.?

“Academic integrity”

5

Definitions

Not all cheating is plagiarism.

For example, taking crib-sheets into an exam.

What about “contract cheating”, where student pays another to write an assignment for them?

We adopt a broad interpretation of “plagiarism” (otherwise we may miss important types of cheating which are appropriate for use to cover here).

6

Plagiarism vs. Cheating

Cheating is potentially illegal.

Not fair on the other students.

Compromises the learning process.

Wastes time

— Staff time!

— Paperwork, disciplinary process

We are required to deal with it!

— QAA Quality Code (B6)

7

Why is this Important?

“If you go the bar at lunchtime you can buy a solution to any of our programming assignments. I reckon the incidence of plagiarism is over 50%” (source wishes to remain anonymous, dated 1999).

Around 5% in programming assignments at Warwick University (from detailed analyses of first year programming assignments over several years, from 2002-2004).

Documented cases (90 UK HEIs, all subjects) – 0.72% (source: AMBeR Project Report 2008).

8

How big a Problem is Plagiarism?

Detection is fun.

Algorithms can be applied to the detection process (so Computer Scientists can apply their skills).

Getting involved gives us insights into how students are conducting their studies.

9

Why is this Interesting?

Our experiences

Rainbow Lorikeet, by René Modery, 2006

Basic TheoryBasic Theory

Foundations of the Louvre, photo by Ceronne, 2006

Students must know and understand (clear University policy).

Detection must happen (the more the better!).

Due process (punishment).

Thus … four stages:

Collection Detection Confirmation Investigation

(Culwin and Lancaster, 2002).

12

Four Stages

Get all documents together online:

– So they can be processed;

– Document formats need to be considered;

– Security is an issue.

Coursemaster (Nottingham)

BOSS (Warwick)

Managed Learning Environment (Blackboard, Moodle)

13

Stage 1: Collection

(1) Compare with other submissions (“intra-corpal”)

(2) Compare with external documents (“extra-corpal”)

– essay-based assignments, can use Turnitin

– program code, equations, maybe a problem

(1) is (relatively) easy (can even be done by hand), but

(2) is a big problem.

14

Stage 2: Detection

Software tool says “A and B similar”.

Are they?

Never rely on a computer program!

Requires expert human judgement.

Evidence must be compelling.

Might go to court.

15

Stage 3: Confirmation

A from B, or B from A, or joint work?

If A from B, did B know?

– Open networked file?

– Printer output?

Did the culprit/s understand?

University processes must be followed:

– No shortcuts!

16

Stage 4: Investigation

Text Plagiarism

“Portrait of a Scribe” by Bartolomeo Passerotti (1529-1592)

Essay time …

Funded mainly by subscriptions from institutions.

Cache of – the Internet– all documents submitted to it– anything else it can find!

Compares text of documents submitted to it using a string-matching algorithm.

19

Turnitin® UK

Can be used by academics to

– detect plagiarism

– provide evidence

Can be used by students to

– check their own work

20

Pedagogy

21

Turnitin (1)

22

Turnitin (2)

23

Turnitin (3)

AdvantagesReasonably

accurate

Ease and speed of use

Printed reports

Comprehensive datastore

Most formats

Management tool

DisadvantagesAlgorithm can be fooled

English only

Quotes and references are poorly handled

“False sense of security” 24

Algorithm and Functionality

Computing and Mathematics

A PowerMac G4 ("Mirrored Drive Doors" model) with open case showing the logic board. Photo by Alistair McMillan, 2006.

Discipline specific:

Program code

Diagrams (UML, flowcharts, etc.)

Lab reports

Images (graphics, image processing)

26

Computing

Discipline specific:

Equations

Theorems and proofs

Statistical analyses

MATLAB programs

27

Mathematics

It won’t work!

– String matching algorithm inappropriate

– Database does not contain (much) code

Commercial products exist, for example

– Black Duck Software

– Similix Corporation

28

Why not use Turnitin?

/* Program 1 */

public class Hello {

public static void main(String[] argv) {

System.out.println(“Hello World”)

}

}

/* Program 2 */

public class HelloWorld {

public static void main(String[] x) {

System.out.println(“hello world!”)

}

}

29

/* Programs 1 and 2 */

Program 3

(Source code for MS Windows 7)

Program 4

(code 50% identical to the source code for MS Windows 7)

30

/* Programs 3 and 4 */

public class Sun { static final double latitude=52.4; static final double longitude=-1.5 static final double tpi = 2.0*pi; /* ... */

public static void main(String[] args) { calculate(); }

public static double FNrange(double x) { double b = x / tpi; double a = tpi * (b - (long)(b)); if (a < 0) a = tpi + a; return a; };

public static void calculate() { /* ... */ }/* ... */

31

/* Program 5 */

public class SunsetCalculator { static float latitude=52.4; static float longitude=-1.5; /* ... */

public static void main(String[] args) { findSunsetTime(); }

public static double rangeCalc(float arg) { float x = arg / tpi; float y = 2*3.14159 * (x - (int)(x)); if (y < 0) y = 2*3.14159 + y; return y; };

public static void findSunsetTime() { /* ... */ }/* ... */

32

/* Program 6 */

Apart from source-code re-use, need to think about:

Use of (object-oriented) templates

Converting code to a different language

Code-generator software

Getting source-code written by someone else

What constitutes minimal / moderate / extreme plagiarism?

33

What is Source-Code Plagiarism?

“Open Source” code

Translation between languages

Re-use of code from previous assignments

Placing references within technical documentation (comments)

34

What do Students Misunderstand?

Common equations such as E=mc2 don’t need referencing.

Probably most others do.

Are there any “grey areas”?

35

Mathematical Equations

Why do Students Plagiarise?

Why, Arizona, by Ken Lund, 2010.

Money

Career advancement

Company advancement

Tight deadlines

Poor ethics

What about academics?

37

Digression – Industry

Weak students

Lazy students

Students with poor time management skills

Overworked students

Peer pressure

Cultural factors

Lack of understanding

“Bad, sad or mad” (Culwin, 2006).

38

Students

How do Students Plagiarise?

Tiles on LaSalle Street, New Orleans, by Infrogmation, 2009.

Google

Friends

Lecturer’s notes

Seeing what other students are doing

Textbooks

Code repositories

Forums

Cheat sites

Where to Find Information

‘Rent-A-Coder’

Low rates ($10) – so quality of code?

Plagiarism by hired coders?

Private Internet sites make search engines ineffective.

Use of mobile devices and IM tools makes tracing difficult.

41

Contract Cheating

Break

Photo by Vanderdecken, 2007.

Detection Strategies

Sherlock Holmes and John H. Watson, by Sidney Paget (1860-1908)

Google search on phrases

Abnormal style

Unusual phrases or spellings (incl. in program comments)

Unusual algorithm used by a program

Unusual formatting

– Fonts, indentation (wordprocessor)

– Brace style (etc.) (program)

44

Tricks of the Trade

Detection ToolsPhoto by Wolfen Silva, 2004.

Attribute counting systems (Halstead, 1972; Ottenstein, 1976):

Numbers of unique operators

Numbers of unique operands

Total numbers of operator occurrences

Total numbers of operand occurrences

46

History (1)

Structure-based systems:

Each program is converted into token strings (or something similar)

Token streams are compared for determining similar source-code fragments

Tools: YAP3, JPlag, Plague, GPlag, XPlag., Plaggie, MOSS, Sherlock, Jones, Cogger, SID, SIM, …

47

History (2)

CodeSuite (www.safe-corp.biz)

– Exact algorithm not published

– Patents apply

MossPlus (www.similix.com)

– Commercial version of MOSS

– “multi-million dollar copyright and criminal theft cases”

– Patents apply

48

Commercial Products (examples)

int calculate(String arg) { int ans=0; for (int j=1; j<=100; j++) {

ans *= j;}

return ans;}

Integer doit(String v) { float result=0.0; for (float f=100.0; f > 0.0; f--)

result *= f; return result;}

49

Example (Tokenwise Equivalence)

type name(type name) start type name=number loop (type name=number name compare number operation name) start name operation name end

return nameend

Have a look at the program you have been given.

Can you spot the plagiarised bits?

How much is plagiarised?

What techniques have been used?

50

Intermission …

Guido Malpohl, Karlsruhe, Germany

Code fragment similarity values based on similar tokens found

Java, C#, C, C++, Scheme, and natural language text

Web-based: www.ipd.uni-karlsruhe.de/jplag

Algorithm: Parse programs and tokenise then pairwise compare using “Greedy String Tiling” (Prechelt et al., 2002)

maximises percentage of common token strings

worst case θ(n3), average case linear

Programs must compile?

51

JPlag

52

JPlag Example

Alex Aiken, Berkeley/Stanford, USA, 1994

Multilingual: C, C++, Java, Pascal, Ada, ML, Lisp, or Scheme program

Web-based: theory.stanford.edu/~aiken/moss/

“Winnowing” (Schleimer et al., 2003)

Local document fingerprinting algorithm

Efficiency proven (33% of lower bound)

Guarantees detection of matches longer than a certain threshold

53

MOSS

54

MOSS Example

University of Warwick, Open Source

Open Source – sherlock.org.uk

Multilingual (including natural language), but works best on Java

Preprocesses code (not a full parse!) then simple string comparison. Preprocessing includes:

– Remove comments– Remove whitespace– Normalise formatting/indentation– Tokenise

55

Sherlock

56

Sherlock Example

57

Sherlock – Document Set

MOSS, JPlag and Sherlock are effective

Results returned are not identical, but similar

User interface issues are important

Reliable sets of test data are unavailable.

None of these tools pulls material from the Internet

58

Effectiveness

Latent Semantic Analysis (Cosma and Joy, 2010)

Documents as “bags of words”Known technique in IRHandles synonymy and polysemyMaths is nasty

Clone Detection (Brixtel et al., 2010; Koschke, 2007)

Provenance of code in large software systemsUse of very large datasets (e.g. SourceForge)Not targeted at plagiarismTools include Dup and VCCFinder

59

Other Approaches

Prevention Strategies

Sometimes students are asked to copy

– group assignments

We ask students to share ideas

– that’s what universities are for!

Real programmers re-use code

What is plagiarism?

– maybe not a simple question after all!

61

Plagiarism vs. Collaboration

Never re-use assignments.

Assess deeper levels of learning.

Use tasks allowing multiple solutions.

Integrate tasks.

Set tasks based on recent events / sources.

Configure assignments so each students is given a slightly different version.

Require assignments to be done in controlled conditions (labs).

62

Prevention and Cure (1)

Define institution policy clearly.

Define rôles of institution bodies (exam board, tribunal, etc.)

Make disciplinary process also about learning.

Train staff.

Fast track procedure for minor cases.

Record and monitor.

Adapted from Carroll and Appleton (2001).

63

Prevention and Cure (2)

ProcessOld Bailey, 2006. Unattributed (Wikimedia).

First offence (unless very serious, e.g. PhD), meeting with appropriate senior member of staff in Department:

– tutor / friend / SU representative allowed to accompany student

– nominal penalty available (e.g. mark of 0 for assignment)

– “formative” experience for the student

Second offence (or serious first offence)

– University tribunal

– tutor / friend / SU representative allowed to accompany student

– full range of penalties (including expulsion)

65

Typical Institution Policy

Quality Assurance Code of Practice QA53.

Three levels of offence – Group 1 (minor), Group 2 (moderate), Group 3 (severe).

Possible penalties available for an offence specified by Group (see table in appendix to QA53).

Groups 1 and 2 offences dealt with by Department.

Group 3 offences initiate Board of Inquiry.

Appeals are allowed under certain conditions only.

66

University of Bath

Discussion

The round table, Great Hall, Winchester Castle, by Graham Horn, 2009.

Evaluation

Recommended