Arabic document similarity analysis using n-grams and singular value decomposition

Abstract

The computerized methods for document similarity estimation (or plagiarism detection) in natural languages, evolved during the last two decades, have focused on English language in particular and some other languages such as German and Chinese. On the other hand, there are several language-independent methods, but the accuracy of these methods is not satisfactory, especially with morphological and complicated languages such as Arabic. This paper proposes an innovative content-based method for document similarity analysis devoted to Arabic language in order to bridge the existing gap in such software solutions. The proposed method is based on modeling the relation between documents and their n-gram phrases. These phrases are generated from the normalized text, exploiting Arabic morphology analysis and lexical lookup. Resolving possible morphological ambiguity is carried out through applying Part-of-Speech (PoS) tagging on the examined documents. Text indexing and stop-words removal are performed, employing a new method based on text morphological analysis. The examined documents' TF-IDF model is constructed using Heuristic based pair-wise matching algorithm, considering lexical and syntactic changes. Then, the hidden associations between the unique n-gram phrases and their documents are investigated using Latent Semantic Analysis (LSA). Next, the pairwise document subset and similarity measures are derived from the Singular Value Decomposition (SVD) computations. The performance of the proposed method was confirmed through experiments with various data sets, exhibiting promising capabilities in estimating literal and some types of intelligent similarities. Finally, the results of the proposed method was compared to that of Plagiarism-Checker-X, and the proposed method outperformed Plagiarism-Checker-X, especially for the intelligent similarity cases with syntactic changes.

References

Page 1

	Year	Citations

Page 1