Publication | Closed Access
A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective
841
Citations
160
References
2019
Year
Artificial IntelligenceData AnnotationEngineeringMachine LearningMachine Learning ToolIntelligent SystemsReasons Data CollectionBig Data InfrastructureAi Integration PerspectiveBig Data ModelData ScienceData MiningData CollectionBig Data ArchitectureData IntegrationData ManagementFeature LearningMachine Learning ModelKnowledge DiscoveryComputer ScienceBig Data SearchData-centric AiDeep LearningBig Data AcquisitionAnnotationBig Data
Data collection is a growing bottleneck in machine learning, driven by the rise of new applications lacking labeled data and the larger data demands of deep learning, prompting research across ML, NLP, CV, and data‑management communities. This survey offers a comprehensive data‑management‑centric view of data collection, mapping its operations, proposing usage guidelines, and highlighting open research challenges. The authors analyze data collection as comprising acquisition, labeling, and data/model improvement, and provide a structured research landscape for these activities.
Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data. In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models. We provide a research landscape of these operations, provide guidelines on which technique to use when, and identify interesting research challenges. The integration of machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research.
| Year | Citations | |
|---|---|---|
Page 1
Page 1