Disguised plagiarism detection in Arabic text documents

Abstract

Plagiarism detection is a challenging Natural Language Processing (NLP) task. Recently, many systems have been able to detect the simple verbatim reproduction (copy and paste). However, more disguised plagiarism techniques have been used in real plagiarism cases such as: rewording, synonym substitution, paraphrasing and text manipulation, which make the plagiarism detection task much more difficult. In this paper, we propose two approaches devoted to assist users in detecting plagiarism in Arabic natural language texts. The first approach is based on word-embedding, words alignment, and words weighting for the purpose of measuring the semantic similarity relationships among textual units. The second approach is based on Machine Learning (ML), where the characterisation is performed at the sentence level. We combine lexical, syntactic, and semantic features to assist the detection task. The Support Vector Machine (SVM), Decision Trees (DT), and Random Forests (RF) are investigated. The classifiers are trained and evaluated using the training dataset of the first Arabic Plagiarism Detection (AraPlagDet) shared task 2015. Our experimental results show that the proposed approaches achieve promising results compared to state-of-the-art Arabic plagiarism detection systems.

References

Page 1

	Year	Citations

Page 1