Plagiarism is a common problem for educational institutions. Since it is difficult to detect plagiarized work manually, there is a need of an efficient computerized system, which can detect a theft of program code.
For academic purposes, the best results usually give algorithms, based on tokens comparison. However, they can give false positive results in cases of similar code structure, despite having different domain and purpose. We propose to introduce weighting coefficients for tokens and calculate code similarity percentage using those coefficients. Our experiment study should confirm the quality improvement of software plagiarism detection
[1] A. Aiken. Measure of Software Similarity. [Online]. Available: https://theory.stanford.edu/~aiken/moss/
[2] L. Prechelt, G. Malpohl and M. Philippsen, JPlag: Finding plagiarisms among a set of programs. Technical Report 2000-1, 2000.
[3] S. Schleimer, D.S. Wilkerson and A. Aiken, "Winnowing: Local Algorithms for Document Fingerprinting", Proc. SIGMOD Int'l Conf. Management of Data, pp. 76-85, 2003.
[4] J. Krinke, "Identifying Similar Code with Program Dependence Graphs", Proc. Eighth Working Conf. Reverse Eng. (WCRE' 01), 2001.
[5] Chao Liu, Chen Chen, Jiawei Han and Philip S. Yu. GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis. In the Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06), pp. 872-881, Philadelphia, USA, August 2006.
[6] C.K. Roy, J.R. Cordy, A Survey on Software Clone Detection Research, Queen’s Technical Report: 541, 2007.