CopyCat: Near-Duplicates Within and Between the ClueWeb and the Common Crawl

Abstract

The amount of near-duplicates in web crawls like the ClueWeb or Common Crawl demands from their users either to develop a preprocessing pipeline for deduplication, which is costly both computationally and in person hours, or accepting the undesired effects that near-duplicates have on reliability and validity of experiments. We introduce ChatNoir-CopyCat-21, which simplifies deduplication significantly. It comes in two parts: (1) A compilation of near-duplicate documents within the ClueWeb09, the ClueWeb12, and two Common Crawl snapshots, as well as between selections of these crawls, and (2) a software library that implements the deduplication of arbitrary document sets. Our analysis shows that 14--52, of the documents within a crawl and around~0.7--2.5, between the crawls are near-duplicates. Two showcases demonstrate the application and usefulness of our resource.

References

Page 1

	Year	Citations

Page 1