Publication | Closed Access
CopyCat: Near-Duplicates Within and Between the ClueWeb and the Common Crawl
19
Citations
18
References
2021
Year
Unknown Venue
Search Engine OptimizationEngineeringData DeduplicationSemantic WebInformation RetrievalData ScienceData MiningData IntegrationData ManagementSearch TechnologyKnowledge DiscoveryWeb CrawlsCommon CrawlComputer ScienceContent Similarity DetectionNear-duplicate DocumentsSearch Engine IndexingSoftware LibrarySimilarity SearchDistributed Search Engine
The amount of near-duplicates in web crawls like the ClueWeb or Common Crawl demands from their users either to develop a preprocessing pipeline for deduplication, which is costly both computationally and in person hours, or accepting the undesired effects that near-duplicates have on reliability and validity of experiments. We introduce ChatNoir-CopyCat-21, which simplifies deduplication significantly. It comes in two parts: (1) A compilation of near-duplicate documents within the ClueWeb09, the ClueWeb12, and two Common Crawl snapshots, as well as between selections of these crawls, and (2) a software library that implements the deduplication of arbitrary document sets. Our analysis shows that 14--52, of the documents within a crawl and around~0.7--2.5, between the crawls are near-duplicates. Two showcases demonstrate the application and usefulness of our resource.
| Year | Citations | |
|---|---|---|
Page 1
Page 1