Near-Duplicate Detection in the IDS Corpora of Written German

Abstract

A problem often encountered in the preparation of very large text corpora is the existence of a certain –difficult to estimate– amount of (partial) copies. In other words, large text collections, whether they come from the world wide web or from newspapers’ CMSdumps, usually contain lots of texts that either do not differ at all or differ only slightly and typically stem from the same text production act. Such (partial) copies, or more neutrally and cautiously expressed: (near) duplicates, not only hamper ‘manual’ corpus queries but, more importantly, they may also bias statistical analyses in an unpredictable manner.1 This paper is concerned with the first step required to deal with such corpus contaminations. It presents an algorithm for the detection of (near) duplicates in large text collections by efficiently computing complete similarity matrices, that can serve as a good basis for later identification of unwanted (partial) copies. It further introduces some basic concepts and techniques, compares two different similarity metrics, describes the application of the algorithm to the IDS corpora of written German, and makes some notes on its computational complexity and its scalability.

References

Page 1

	Year	Citations

Page 1