Publication | Open Access
<scp>CoordinateCleaner</scp>: Standardized cleaning of occurrence records from biological collection databases
1K
Citations
22
References
2019
Year
DatabasesBiostratigraphyBioinformatics DatabaseSocial SciencesPhylogenetic AnalysisPaleoenvironmental ReconstructionStandardized CleaningBiogeographyRasterized CoordinatesBiostatisticsBiorepositoryPalaeo-environmental ReconstructionSpecies Occurrence RecordsBiodiversityBiological DatabaseProblematic Coordinate RecordsGeographyData CleansingBioinformaticsBiologyNatural SciencesMicrobiologyPaleoecology
Species occurrence records from online databases are indispensable for ecological, biogeographical, and palaeontological research, yet data quality problems such as incorrect geo‑referencing or dating diminish their usefulness. The authors introduce CoordinateCleaner, an R package designed to scan large species‑occurrence datasets for geo‑referencing and dating imprecisions and data‑entry errors in a standardized, reproducible manner. CoordinateCleaner provides functions that flag problematic coordinates using gazetteers, a database of 9,691 biodiversity institutions to detect horticulture or captivity records, algorithms for rasterized data, conversion errors, decimal rounding, and spatio‑temporal tests for fossils, and demonstrates these on over 90 million GBIF plant occurrences and 19 000 PBDB fossil records. The package identified more than 3.4 million potentially problematic GBIF records (3.7 %) and 179 datasets (18.5 %) with rasterized coordinates, while PBDB contained 1,205 problematic records (6.3 %); all functions and the institution database are open‑source within the package.
Abstract Species occurrence records from online databases are an indispensable resource in ecological, biogeographical and palaeontological research. However, issues with data quality, especially incorrect geo‐referencing or dating, can diminish their usefulness. Manual cleaning is time‐consuming, error prone, difficult to reproduce and limited to known geographical areas and taxonomic groups, making it impractical for datasets with thousands or millions of records. Here, we present CoordinateCleaner , an r ‐package to scan datasets of species occurrence records for geo‐referencing and dating imprecisions and data entry errors in a standardized and reproducible way. CoordinateCleaner is tailored to problems common in biological and palaeontological databases and can handle datasets with millions of records. The software includes (a) functions to flag potentially problematic coordinate records based on geographical gazetteers, (b) a global database of 9,691 geo‐referenced biodiversity institutions to identify records that are likely from horticulture or captivity, (c) novel algorithms to identify datasets with rasterized data, conversion errors and strong decimal rounding and (d) spatio‐temporal tests for fossils. We describe the individual functions available in CoordinateCleaner and demonstrate them on more than 90 million occurrences of flowering plants from the Global Biodiversity Information Facility (GBIF) and 19,000 fossil occurrences from the Palaeobiology Database (PBDB). We find that in GBIF more than 3.4 million records (3.7%) are potentially problematic and that 179 of the tested contributing datasets (18.5%) might be biased by rasterized coordinates. In PBDB, 1205 records (6.3%) are potentially problematic. All cleaning functions and the biodiversity institution database are open‐source and available within the CoordinateCleaner r ‐package.
| Year | Citations | |
|---|---|---|
Page 1
Page 1