Publication | Open Access
Textual Data Augmentation for Efficient Active Learning on Tiny Datasets
45
Citations
38
References
2020
Year
Unknown Venue
Artificial IntelligenceFew-shot LearningTextual Data AugmentationStructured PredictionEngineeringMachine LearningLlm Fine-tuningCorpus LinguisticsText MiningNatural Language ProcessingInformation RetrievalData ScienceComputational LinguisticsData Generation TaskNon-guided Data GenerationSemi-supervised LearningSupervised LearningMachine TranslationData AugmentationBenchmark DatasetsNlp TaskKnowledge DiscoveryComputer ScienceDeep LearningRetrieval Augmented GenerationLanguage Generation ModelLanguage Generation
In this paper we propose a novel data augmentation approach where guided outputs of a language generation model, e.g. GPT-2, when labeled, can improve the performance of text classifiers through an active learning process. We transform the data generation task into an optimization problem which maximizes the usefulness of the generated output, using Monte Carlo Tree Search (MCTS) as the optimization strategy and incorporating entropy as one of the optimization criteria. We test our approach against a Non-Guided Data Generation (NGDG) process that does not optimize for a reward function. Starting with a small set of data, our results show an increased performance with MCTS of 26% on the TREC-6 Questions dataset, and 10% on the Stanford Sentiment Treebank SST-2 dataset. Compared with NGDG, we are able to achieve increases of 3% and 5% on TREC-6 and SST-2.
| Year | Citations | |
|---|---|---|
Page 1
Page 1