71
Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Embed Size (px)

Citation preview

Page 1: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Mike Joy

25 February 2010

New Approaches for Detecting Similarities in Program Code

Page 2: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Overview of Talk

1) What is the Problem?2) Historical Overview3) New Approaches4) Where Next?

Page 3: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Part 1 – What is the Problem?

Document similarity– What do we mean?– Why is software an issue?– Why is this interesting?

Page 4: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Four stages

Collection

Detection

Confirmation

Investigation

From Culwin and Lancaster (2002).

Page 5: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Stage 1: Collection

• Get all documents together online– so they can be processed– formats?– security?

• BOSS (Warwick)

• Coursemaster (Nottingham)

• Managed Learning Environment

Page 6: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Stage 2: Detection

• Compare with other submissions

• Compare with external documents– essay-based assignments

• We’ll come back to this later– it’s the interesting bit!

Page 7: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Stage 3: Confirmation

• Software tool says “A and B similar”

• Are they?

• Never rely on a computer program!

• Requires expert human judgement

• Evidence must be compelling

• Might go to court

Page 8: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Stage 4: Investigation

• A from B, or B from A, or joint work?

• If A from B, did B know?– open networked file– printer output

• Did the culprit/s understand?

• University processes must be followed

Page 9: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Why is this Interesting?

How do you compare two programs?– This is an algorithm question– Stages 2 and 3: detection and confirmation

How do you use the results (of a comparison) to educate students?– This is a pedagogic question– Stage 4, and before stage 1!

Page 10: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Digression: Essays

Plagiarism in essays is easier to detect

Lots of “tricks” a lecturer can use!– Google search on phrases– Abnormal style– ... etc.

Software tools– Let's have a look ...

Page 11: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code
Page 12: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code
Page 13: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Pedagogy

Can be used by academics to– detect plagiarism– provide evidence

Can be used by students to– check their own work

Page 14: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Part 2 – Historical Overview

How has similar code been detected in the past?

How well do the approaches work?

Page 15: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Why not use Turnitin?

• It won’t work!• String matching algorithm inappropriate• Database does not contain code

• Commercial involvement– E.g. Black Duck Software

Page 16: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

/* Program 1 */

public class Hello {

public static void main(String[] argv) {

System.out.println(“Hello World”)

}

}

/* Program 2 */

public class HelloWorld {

public static void main(String[] x) {

System.out.println(“hello world!”)

}

}

Page 17: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Is This Plagiarism?

• Is Program 2 derived from Program 1 in a manner which is “plagiarism”?

• Probably No– It's too simple– Too many copies in books / on the web– Most of it is generic syntax

Page 18: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Program 3

(Source code for MS Windows 7)

Program 4

(code 98% identical to the source code for MS Windows 7)

Page 19: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Is This Plagiarism?

• Is Program 4 derived from Program 3 in a manner which is “plagiarism”?

• Definitely Yes– It's too complicated to happen by chance

• Millions of lines of code

– The source is “closed”• Microsoft guard it very well!

Page 20: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

/* Program 5 */

public class Sun {

static final double latitude=52.4;

static final double longitude=-1.5;

static final double tpi = 2.0*pi;

/* ... */

public static void main(String[] args) { calculate(); }

public static double FNrange(double x) {

double b = x / tpi;

double a = tpi * (b - (long)(b));

if (a < 0) a = tpi + a; return a;

};

public static void calculate() { /* ... */ }

/* ... */

/* Program 6 */

public class SunsetCalculator {

static float latitude=52.4;

static float longitude=-1.5;

/* ... */

public static void main(String[] args) { findSunsetTime(); }

public static double rangeCalc(float arg) {

float x = arg / tpi;

float y = 2*3.14159 * (x - (int)(x));

if (y < 0) y = 2*3.14159 + y; return y;

};

public static void findSunsetTime() { /* ... */ }

/* ... */

Page 21: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Is This Plagiarism?

• Is Program 6 derived from Program 5 in a manner which is “plagiarism”?

• Maybe– Structure is similar – cosmetic changes– But the algorithm is public domain– Maybe 6 derived from 5, maybe the other

way round

Page 22: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

History ...

• First known plagiarism detection system was an attribute counting program developed by Ottenstein (1976)

• More recent systems compare the structure of source-code programs

• Structure-based systems include: YAP3, MOSS, JPlag, Plague, and Sherlock.

Page 23: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Detection Tools (1)

Attribute counting systems (Halstead, 1972):

• Numbers of unique operators• Numbers of unique operands• Total numbers of operator occurrences• Total numbers of operand occurrences

Page 24: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Detection Tools (2)

Structure-based systems:

– Each program is converted into token strings (or something similar)

– Token streams are compared for determining similar source-code fragments

– Tools: JPlag, MOSS, and Sherlock

Page 25: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Example (code 1)

int calculate(String arg) { int ans=0; for (int j=1; j<=100; j++) {

ans *= j;}

return ans;}

Page 26: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Example (code 2)

Integer doit(String v) { float result=0.0; for (float f=100.0; f > 0.0; f--)

result *= f; return result;}

Page 27: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Example (tokenised)

type name(type name) start type name=number loop (type name=number name compare number operation name) start name operation name end

return nameend

Page 28: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Detectors

MOSS (Berkeley/Stanford, USA)

JPlag (Karlsruhe, Germany)– Java only– Programs must compile?

Sherlock (Warwick, UK)

MOSS and JPlag are Internet resources– Data Protection?

Page 29: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

MOSS

Developed by Alex Aiken in 1994

MOSS (for a Measure Of Software Similarity) determines the similarity of C, C++, Java, Pascal, Ada, ML, Lisp, or Scheme programs.

MOSS is free, but you must create an account

MOSS home page:http://theory.stanford.edu/~aiken/moss/

Page 30: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

MOSS – Algorithm

“Winnowing” (Schleimer et al., 2003)– Local document fingerprinting algorithm– Efficiency proven (33% of lower bound)– Guarantees detection of matches longer

than a certain threshold

Page 31: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Using MOSS

• Moss is being provided as an Internet service• User must download MOSS Perl script for

submitting files to the MOSS server• The script uses a direct network connection• The MOSS server produces HTML pages listing

pairs of programs with similar code• MOSS highlights similar code-fragments within

programs that appear the same• Data Protection? – US service• Maintenance?

Page 32: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

JPlag

• Developed by Guido Malpohl in 1996• JPlag currently supports Java, C#, C, C++, Scheme,

and natural language text• Use of JPlag is free, but user must create an account• JPlag can be used to compare student assignments

but does not compare with code on the Internet• JPlag home page: www.ipd.uni-karlsruhe.de/jplag

Page 33: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

JPlag – Algorithm

1) Parse (or scan) programs

2) Convert programs to tokens

3) Pairwise compare • “Greedy String Tiling”

- maximises percentage of common token strings- worst case θ(n3), average case linear

Prechelt et al. (2002)

Page 34: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

JPlag File Processing

Page 35: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

JPlag - Results

• Results in HTML Format

• Histogram of similarity values found for all pairs of programs

• Similar pairs and their similarity values displayed

• Select file pairs to view

Page 36: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

JPlag - Matches

• Similar lines matched with the same colour

• Code fragment similarity values based on similar tokens found

Page 37: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Sherlock

• Developed at the University of Warwick Department of Computer Science

• Sherlock was fully integrated with the BOSS online submission software in 2002 and Open-Sourced

• Sherlock detects plagiarism on source-code and natural language assignments

• BOSS home page: www.boss.org.uk

Page 38: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Sherlock - Preprocessing

WhitespaceCommentsNormalisationTokenisation

Page 39: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Sherlock – Results

• Results displayed• Similarity values of

suspicious files• Similarity values

depend on the length of similar lines found as a percentage of the whole file size

• Select suspicious matches to examine

• Mark suspicious files

Page 40: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Sherlock – Matches

Suspected sections marked with

**begin suspicious section**

and

**end suspicious section**

Page 41: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Sherlock – Document Set

• User can view graph

• Each node represents one submission

• An edge means two submissions

• Options to select threshold

• Click on lines to view or to mark suspicious matches

Page 42: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

CodeMatch

Commercial productFree academic use for small data setsExact algorithm not published

• patent pending?

Page 43: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Example of Identical “Instruction Sequences”

/* File 1*/

for (int i=1; i<10; i++) { if (a==10) print(“done”); else a++; }

/* File 2*/

for (int x=100; x > 0; x--) { if (z99 > -10) print(“ans is ” + z99); else { abc += 65; } }

Page 44: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

CodeMatch – Algorithm

1) Remove comments, whitespace and lines containing only keywords/syntax; compare sequences of instructions

2) Extract comments, and compare3) Extract identifiers, and count similar;

x, xxx, xx12345 are “similar”4) Combine (1), (2) and (3) to give

correlation score

Page 45: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Heuristics

Comments– Spelling mistakes– Unusual English (Thai, German, …)

Use of Search EnginesUnusual styleCode errors

Page 46: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Tool Efficiency

• MOSS, JPlag and Sherlock are effective• Results returned are similar• Results returned are not identical• User interface issues may be important

Page 47: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Part 3 – New Approaches

Eschew the “syntax driven” approach

Lateral thinking?

Case study: Latent Semantic Analysis

Page 48: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Digression: Similarity

What do we actually mean by “similar”?

This is where the problems start ...

Page 49: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

(1) Staff Survey

We carried out a survey in order to:– gather the perceptions of academics on what

constitutes source-code plagiarism, and– create a structured description of what constitutes

source-code plagiarism from a UK academic perspective

– Cosma and Joy (2008)

Page 50: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Data Source

• On-line questionnaire distributed to 120 academics – Questions were in the form of small scenarios– Mostly multiple-choice responses– Comments box below each question– Anonymous – option for providing details

• Received 59 responses, from more that 34 different institutions

• Responses were analysed and collated to create a universally acceptable source-code plagiarism description.

Page 51: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Results

Grey areas include:

– O-O templates– Inappropriate collaboration– Translating between (programming)

languages– Re-use of work already submitted

Page 52: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Other Issues

Various issues on source-code plagiarism including:– Source-code reuse– Source-code self-plagiarism– Copying without adaptation– Copying with adaptation: minimal, moderate,

extreme– Converting source to another language– Using code-generator software– Collusion– Obtaining source-code written by other authors – False and “pretend” references

Page 53: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

(2) Student Survey

We carried out a survey (Joy et al., 2008) in order to:– gather the perceptions of students on what

(source code) plagiarism means,– identify types of plagiarism which are poorly

understood, and– identify categories of student whoperceive the

issue differently to others

Page 54: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Data Source

• Online questionnaire answered by 770 students from computing departments across the UK

• Anonymised, but brief demographic information included

• Used 15 “scenarios”, each of which may describe a plagiaristic activity

Page 55: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Results (1)

No significant difference in perspectives in terms of

– university– degree programme– level of study (BS, MS, PhD)

Page 56: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Results (2)

Issues which students misunderstood:

– Open Source code– Translating between languages– Re-use of code from previous assignments– Placing references within technical

documentation

Page 57: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Latent Semantic Analysis

Documents as “bags of words”

• Known technique in IR

• Handles synonymy and polysemy

• Maths is nasty

Results reported in (Cosma and Joy, 2010)

Page 58: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Document Corpus

• m x n “term by document” matrix A• Rows = unique words• Columns = documents• Entries = no. of occurrences

Page 59: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Term Weighting

Algorithm to weight data in A• Local and global weights• Importance of terms in matrix A

Page 60: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Singular Value Decomposition (SVD)

Decompose m x n matrix A = U∑VT

U is an m x r “term by dimension” matrixV is an n x r “file by dimension” matrix∑ is an r x r “singular values” matrix

Truncate matrices to k dimensions, where k ≤ r

Page 61: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

SVD (2)

Ak = Uk∑kVkT

Reduces “noise”Highlights important relations between

terms and documents

Size of k determined experimentally

Page 62: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

SVD (3)

Given a “query” q (set of weighted keywords), can map to k-space:

Qk = qTUk∑k-1

Think of Q as a k-vector; can compare to vectors representing files using e.g. “cosine similarity” (dot product)

Page 63: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Uses of LSA

Essay gradingEssay feedbackIndexingLanguage independent processingCross-language information retrievalSource-code clusteringPlagiarism detection (natural language)

Page 64: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code
Page 65: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code
Page 66: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code
Page 67: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Summary

LSA can help detect plagiarism instances missed by other tools• Improved recall but poorer precision• Integration with structure-based tools is

effective

Visualisation of relative file similaritiesPredictability of LSA results is problematic

Page 68: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

Where Next?

• Algorithms to include Internet-located code• “Blended” algorithms• Cross-language detection• Further exploration of LSA

Page 69: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

References (1)

F. Culwin and T.Lancaster, “Plagiarism, prevention, deterrence and detection”, [online] available from:www.heacademy.ac.uk/assets/York/documents/resources/resourcedatabase/id426_plagiarism_prevention_deterrence_detection.pdf ) 2002(

G. Cosma and M.S. Joy, “An Approach to Source-Code Plagiarism Detection and Investigation using Latent Semantic Analysis” IEEE Transactions on Computers, to appear (2010)

G. Cosma and M.S. Joy, “Towards a Definition on Source-Code Plagiarism”, IEEE Transactions on Education 51(2) pp. 195-200 (2008)

Page 70: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

References (2)G. Cosma and M.S. Joy, “Source-code Plagiarism: a UK

Academic Perspective”, Proceedings of the 7th Annual Conference of the HEA Network for Information and Computer Sciences (2006)

M. Halstead, “Natural Laws Controlling Algorithm Structure, ACM SIGPLAN Notices 7(2) pp. 19-26 (1972)

M.S. Joy, G. Cosma, J.Y-K. Yau and J.E. Sinclair (2008), “Source Code Plagiarism – a Student Perspective” (under review)

M.S. Joy and M. Luck, “Plagiarism in Programming Assignments”, IEEE Transactions on Education 42(2), pp. 129-133 (1999)

Page 71: Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code

References (3)K. Ottenstein, “An Algorithmic Approach to the

Detection and Prevention of Plagiarism”, ACM SIGCSE Bulletin 8(4) pp. 30-41 (1976)

L. Prechelt, G. Malpohl and M. Philippsen, “Finding “Plagiarisms among a Set of Programs with JPlag”. Journal of Universal Computer Science 8(11) pp. 1016-1038 (2002)

S. Schleimer, D.S. Wilkerson and A. Aitken, “Winnowing: Local Algorithms for Document Fingerprinting”, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 76-85 (2003)