[IEEE Proceedings of the ITI 2009 31st International Conference on Information Technology Interfaces (ITI) - Cavtat/Dubrovnik, Croatia (2009.06.22-2009.06.25)] Proceedings of the ITI

Detecting Computer Code Plagiarism in Higher Education

Mario Konecki, Tihomir Orehovacki, Alen Lovrencic Faculty of Organization and Informatics

Pavlinska 2, 42000 Varaždin, Croatia [email protected], [email protected], [email protected]

Abstract. Rapid development of industry and economy requires quick and efficient education of large amount of theoretical knowledge and practical skills. In computer science, especially in programming, this trend is very noticeable and real experts are needed and hard to create. But new problem has emerged in higher education and its name is plagiarism. In order to prevent this one would have to check all students program codes to find similarities. To do this efficiently a procedure has been designed upon which a certain prototype was developed and tested. We discuss efficiency of this solution and we also mention some other methods and algorithms. We also discuss some other possible usages of this solution and we mention further actions and steps in our research. Keywords. plagiarism detection, program code, comparison, higher education 1. Introduction

Computer experts are in high demand, especially today when almost all aspects of our lives are in one or another way connected to computer technology. All computer systems rely on computer programs that give them a particular task and purpose. Because of that the education of young programming experts is of great importance.

When looking at higher education process there is frequently one repeating problem and that is a problem of plagiarism, especially of programming code. Students have a practice to copy and share their solutions rather that doing them by themselves. In this way a percentage of students that could became capable experts in the area of programming decreases in a large percentage.

To prevent this, there is a need to detect and sanction any form of plagiarism in this area. This method upon which an algorithm needs to be constructed could also be beneficial in other

areas. This kind of algorithm would be very usable in detecting parts of code that can be reused so the code would be optimized and minimized.

There are many existing concepts that try to make programming easier and more efficient such as: design patterns [4], components [2], metacomponents [14], etc. All these concepts add a little bit to the whole programming process and concepts that we discuss in this article are yet another step closer to writing better and more efficient programming code that is easy to maintain.

In the following section we will discuss the problem of plagiarism and some existing methods and concepts of comparison. We will also mention some algorithms existing and then we will present our own solution to the problem mentioned. 2. Plagiarism detection

The term comparison is closely connected to the term plagiarism which is mostly mentioned in higher education [11]. Plagiarism of computer code is just one part of all other types of the same problem (copying of papers, seminars, etc.). When we talk about higher education then we can say that there are two categories of plagiarism [11]: - plagiarism - usage of some web content as one’s own - collusion - individual assignments that are done by two or more students

Both of these categories are common when talking about plagiarism of computer code. There are number of web sources from which students can copy a part of a whole program code and there is also frequent student practice to collaborate on their individual assignments.

When talking about comparison of some text it is important to mention so called Brown corpus which tells us that in a group of 1 million words, 40% of them occur only once [8]. Word is

409Proceedings of the ITI 2009 31st Int. Conf. on Information Technology Interfaces, June 22-25, 2009, Cavtat, Croatia

however not the smallest element by which proper distinction can be done. It was realized that the smallest element that can be used to recognize and fingerprint some text is trigram (three consecutive words) [11]. This fact is supported by results which showed that any article has in average 77% of unique trigrams [11]. One of more known comparators of computer code that is trigram based is the Ferret detector/comparator. It uses some preprocessing to prepare the code for analysis. There are also other similar algorithms available.

Anyway, computer program differs from standard text. It is more formal than standard text, and in the plagiarism in the program code one should take much more attention to the codes that are syntactically different, but is semantically the same.

Namely, opposite to text, program code plagiarism can be detected on the level of the idea of the program – algorithm and can be camouflaged by syntax changes in the code, so code looks differently but has the same functionality. In other words to detect plagiarism in a program code efficiently, one has to detect invariants of the same code, which has the same functionality.

Along with pure code comparing another possible usage of this kind of algorithms is computer code optimization and minimization. There are several possible algorithms known that are specialized for this purpose [1]: - Text-based techniques perform little or no transformation to the “raw” source code before attempting to detect identical or similar (sequences of) lines of code. Typically, white space and comments are ignored. - Token-based techniques apply a lexical analysis (tokenization) to the source code and, subsequently, use the tokens as a basis for clone detection. - AST-based techniques use parsers to first obtain a syntactical representation of the source code, typically an abstract syntax tree (AST). The clone detection algorithms then search for similar subtrees in this AST. - PDG-based approaches go one step further in obtaining a source code representation of high abstraction. Program dependence graphs (PDGs) contain information of a semantical nature, such as control and data flow of the program. - Metrics-based techniques are related to hashing algorithms. For each fragment of a program the values of a number of metrics are calculated,

which are subsequently used to find similar fragments. - Information Retrieval-based methods aim at discovering similar high level concepts by exploiting semantic similarities present in the source code itself (including the comments).

By finding repeating and similar parts of the code this kind of algorithms enable the programmer to write his code in just one place and call variants of the same code in many places which consequently minimizes and optimizes the code and makes its future maintenance easier. 3. Building the comparison algorithm

In order to perform selection of similar codes a concept that supports this action has been formed. It consists of several steps/filters that normalize the source codes, annulating syntax differences that don’t have influence to the program functionality and then finally make the desired comparison. These steps are: - divide program code into parts where one part is one function - remove all declarations of variables or functions - replace all variable names with a constant name X - replace all function names with a constant name Y - remove all input commands (lines) - remove all output commands (lines) - remove all blank lines - remove all blank spaces - if there are lines with only “(” and “)” then read all those lines, lines between them and form one line of format (content) - if there are “{” or “}” at the beginning or end of lines then move these brackets to a new line before or after the content between them - compare all lines of all computer code by parts that are result of the first step - also compare the size of these parts in order to try predicting the content of a part by its size

In the 11th step a comparison is made. Comparison can be made by comparing line by line as a pure text and simply denoting whether they are the same or not since they are well prepared by several filters but there is also a possibility of using other methods for calculating the difference that exist inside the information theory.

The difference between these lines in information theory and computer science is called edit distance and it represents the

410

minimum number of insertions, deletions, and substitutions required to transform one string into the other [13]. This measure can be used to denote the difference between code lines and to determine the degree of similarity. Some of more known algorithms for calculating edit distance are: - Hamming distance [5] - Levenshtein distance [9][12][7] - Damerau-Levenshtein distance [3][10] - Jaro-Winkler distance [6]

All these algorithms have one major problem when talking about code comparison. They all function well with texts because if we look at one sentence and then rewrite this sentence in another way then we can say that it is syntactically and semantically different and as such has a new value. But when dealing with program codes then lines that are syntactically different are frequently semantically completely the same and as such have no new value because they are just invariants of the same solution. This is a fact since one problem can have many different solutions and many of these solutions are very similar. What has been noticed is that a common practice with student is to make some minor chances on the code surface (different text sentence, different variable names, etc.) in order to make a code look different but in fact it is just the invariant of the same solution.

What our own solution offers and what lacks completely of partially in other mentioned algorithms is a fair amount of preprocessing that is not necessary when comparing pure text but it is obligatory when dealing with computer code. Preprocessing is designed to prepare the code for analysis. It is aimed to eliminate all possible tricks that are used to mask the code in order to appear as completely different solution. This process helps greatly to remedy the problem of semantically different solutions that are in fact just invariants of the same solution. This kind of prepared code is then rather suitable to compare even with mentioned algorithms for edit distance calculation. Also an idea of comparing checksums that is little different for little content differences in order to try predicting the content of a part by its size is something that is not present in mentioned algorithms and would be greatly beneficial for speeding up the whole comparison process. Implementation of this part of solution will be a part of our further work.

4. Prototype of code comparator

In order to test developed concept a prototype was constructed. The prototype was built in Visual Basic since it is rather simple and clear programming language. The prototype compares C/C++ computer codes. Speed and complexity of the code were also taken into consideration and will be discussed later. Prototype included most of steps from developed concept.

Code of the prototype consists of several filters that prepare code for comparison and of comparison itself. In this initial prototype, the comparison is made by comparing two code lines rather that by calculating edit distance. Further development of the prototype to fully support developed algorithm that will include more complex comparison process will be a part of our future efforts.

A code example of function that removes single blank spaces is given below:

Dim temp As String sText2 = "" Dim d As Long For u = 0 To 10000 d = u If caseL = 1 Then temp = GetLine(txtCode.hwnd, d) ElseIf caseL = 2 Then temp = GetLine(txtCode2.hwnd, d) End If If Len(temp) = 0 Then If caseL = 1 Then txtCode.Text = sText2 ElseIf caseL = 2 Then txtCode2.Text = sText2 End If Exit Sub End If sText2 = sText2 & Replace(temp, " ",

"") & vbCrLf Next u If caseL = 1 Then txtCode.Text = sText2 ElseIf caseL = 2 Then txtCode.Text = sText2 End If All filters/functions that are included into the

prototype include the following filtering: - every computer instruction has to be in the separate lines - all brackets must be in separate lines - all input/output instruction have to be omitted

411

- names of functions/variables must be unified - arguments of all functions have to be omitted - all brackets that are alone and without body have to be omitted - declarations have to be omitted but initializations not - all blank spaces and lines have to be omitted - all comments have to be omitted

After filtering and preparation of code there is a goal to compare the very structure of the code and in this way find the plagiarism attempts. All typical changes that students make when trying to use some other work as its own will be noted and cast out by the previously mentioned filters. The filtering is essential for the successful comparison. Without the proper filtering the comparison would not be of much use.

A test of this prototype was conducted on over hundred examples and showed that even in this prototype phase the constructed algorithm is very suitable for plagiarism detection. In all tested cases the prototype found enough similarities to detect every attempt of plagiarism.

There are some details that could be even more improved and that will be the part of our future work. We attempt to expand the prototype to several programming languages and implement better support for all data types as well as compare implementation of this kind of software on several platforms to optimize its efficiency and minimize the time consumption.

In our present solution comparison (11th step of the algorithm) is made by comparing computer codes line by line and stating weather they are the same of different. In our future work we plan to implement several edit distance algorithms that would additionally compare these lines and give a more detailed measure of difference.

A gathered sample of computer codes has been obtained from first year computer science students. Solutions of several different assignments from all students were obtained. 40% of gathered solutions were randomly selected to be used in this test. A screenshot showing the comparison of two codes is given below in figure 1.

Figure 1. Comparator comparing two codes of the same assignment

As noticed, the changes that students have done in 92% to create an illusion that the code is theirs are: - change the names of variables and functions - change sentences that are written to the user - put blank spaces or lines in the code - change comments - change some functions and variable types In other 8% of the cases students change the order of the variables and functions or change the number of arguments in functions.

All this and more is covered by filters and thus they are functional for this kind of purpose. In our future development we will enhance created algorithm with even more rules and situations that will allow optimization and minimization as well as simple comparison and in that form it will broaden the usage of created prototype as well. Different technologies as in implementation and in support will also give this concept a new perspective.

When completed this prototype will be benchmarked in all areas against existing algorithms and tools to see how concurrent it really is. In this phase we can conclude that functional prototype base on created concepts gave a pretty satisfying results that give us a new motivation to continue with our work.

5. Complexity of given solution

The procedure for the code comparison has to be very efficient because in the real situation teacher is faced with few hundreds of codes that have to be compared pairwise.

Proposed procedure is most efficient that can be. Regardless we presented the procedure as a row of the filter steps, all these filters can be implemented in the one single sequence. After

412

that we need another reading sequence for the comparison of filtered codes. That gives us two sequences of reading of codes with the operations that can be executed in the constant time. So, two program codes can be compared in the O (n) time, where n is measure of the length of the compared codes in program lines.

So if we have N programs, we have to make

��

��

�2N

comparisons of the code. That makes a

final complexity of the comparison procedure for N programs that has n lines is O (nN2). 6. Comparison optimization

In presented solution there is one issue that requires optimization in order to minimize the resources consumption. If we look at the whole program codes comparison process we can say that after filtering of all codes it is necessary to compare every program code with all others. So basically there are a number of steps were two by two codes are compared. This requires that in each cycle of comparison every filtered code is read again and again which can be very demanding process if there is a bigger number of codes to compare.

In order to solve this and optimize the comparison process an additional step has to be taken. This step would include a suitable usage of some checksum function. This function would be applied after filtering of every program code and it would generate a checksum number for every program code. So in one reading of all codes checksums would be created and as such would be compared to detect similar codes rather that reading and comparing all codes over and over. This would greatly save resources of any kind.

In order to do this efficiently there is one special demand for created checksums. Classical checksum algorithms create checksums that are very different for event the smallest difference in the content that is a basis for creating of checksum. For this purpose checksums would have to be little different for little differences in program codes in order to efficiently determine the degree of similarity and detect plagiarism attempts. Solving, implementing and testing of this kind of solution will also be one of the next steps in our research.

7. Conclusion

Plagiarism of computer code is a serious problem in higher education. There is a need for real programming experts and the educational process has to be brought up to a level of maximum efficiency. There are some algorithms that deal with plagiarism detection however we have tried to produce a comparator that would be suitable for higher education purpose, particularly in the field of computer science. The developed procedure and prototype showed very promising results when tested on a number of examples from real students. There is still much room for improvement and that will be the next step in our research.

It is our intention to even further develop the process of comparing and detecting code duplications and to make algorithm even more suitable for code comparison as well as for code optimization and minimization. Upon enhancement of our prototype a more detailed testing will be conducted as well as testing of code optimization. Our own solution will then be compared against the best already existing solutions. We will also consider which technologies would be the best for implementation of the prototype in order to make the process as little resource-consuming as possible. 8. References [1] Bruntink, M., Deursen, A., Engelen, R.,

Tourwe, T., On the Use of Clone Detection for Identifying Crosscutting Concern Code, IEEE Transactions on Software Engineering, Vol. 31, No. 10, 2005.

[2] Crnkovic, I., Larsson, M., Building Reliable Component-Based Software Systems, Artech House, Boston, 2002.

[3] Damerau, F.J., A technique for computer detection and correction of spelling errors, Communications of the ACM, 1964.

[4] Gamma, E., Helm, R., Johnson, R., Vlissides, Design Patterns: Elements of Reusable Object-Oriented Software, Addison-Wesley, 1995.

[5] Hamming , R. W., Error Detecting and Error Correcting Codes, Bell System Technical Journal 26(2):147-160, 1950.

[6] Jaro, M. A., Advances in record linking methodology as applied to the 1985 census of Tampa Florida, Journal of the American Statistical Society 84 (406): 414–420., 1989.

413

[7] Konstantinidis, S., Dept. of Math. & Comput. Sci., Saint Mary's Univ., Halifax, NS, Canada, Computing the Levenshtein distance of a regular language, Information Theory Workshop, 2005 IEEE, 2005.

[8] Kupiec, J., Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language, 6(3), 225-242., 1992.

[9] �.�. �� (1965) �� , �� . �� 163.4:845–848. Appeared in English as: Levenshtein, V. I., Binary codes capable of correcting deletions, insertions, and reversals, Soviet Physics Doklady 10:707–710., 1966.

[10] Lowrance, R., Wagner, R. An Extension of the String-to-String Correction Problem, JACM, 1975.

[11] Lyon, C., Barrett, R., Malcolm, J., Plagiarism Is Easy, But Also Easy To Detect, Plagiary: Cross-Disciplinary Studies in Plagiarism, Fabrication, and Falsification, 1 (5): 1-10, 2006.

[12] Navarro, G., A guided tour to approximate string matching, ACM Computing Surveys, 33(1):31–88, 2001.

[13] Ristad, E. S., Yianilos, P. N., Learning string-edit distance, IEEE Transactions on Pattern Analysis and Machine Intelligence 20:522-532 ,1998.

[14] Villac�s, J. E., The Component Architecture Toolkit, Indiana University, Department of Computer Science, 1999.

414

Documents

[IEEE Proceedings of the ITI 2009 31st International Conference on Information Technology Interfaces (ITI) - Cavtat/Dubrovnik, Croatia (2009.06.22-2009.06.25)] Proceedings of the ITI