MatchDetectReveal: Finding Overlapping and Similar Digital Documents

Abstract

The Internet provides easy access to large collections of semi-structured digital documents. WWW browsers, search engines and the cut & paste technique are tempting to substitute one's creativity by simple compilation from appropriate digital resources. This paper discusses the problems of detecting plagiarism in large collections of semi-structured electronic texts. Overlaps in and similarity of digital documents and software code are in the focus of this project. The conceptual architecture of the MatchDetectReveal system is presented along with possible applications. The main component of the system is using the string matching algorithms and a suffix tree representation. Both sequential and parallel cluster-based processing issues are addressed. The implementation and performance issues are also discussed.

References

Page 1

	Year	Citations

Page 1