Publication | Closed Access
Efficient Large Scale NLP Feature Engineering with Apache Spark
25
Citations
34
References
2022
Year
EngineeringKnowledge ExtractionApache SparkSemantic WebText MiningNatural Language ProcessingInformation RetrievalData ScienceData MiningComputational LinguisticsLanguage EngineeringNamed-entity RecognitionRdd ApiFeature EngineeringEnglish WikipediaNlp TaskKnowledge DiscoveryComputer ScienceInformation ExtractionKeyword ExtractionArtsMassive Data ProcessingBig Data
Feature engineering is a computationally time-consuming process in the end-to-end machine learning pipeline. Large amounts of text data are being generated on many heterogeneous sources and platforms on the internet. The compute resources needed to extract valuable features from these big datasets are increasing significantly. In this research, we evaluate the runtime of the RDD and the Spark-SQL APIs of the Apache Spark framework to extract text features from the corpus of english Wikipedia. As a result, we demonstrate the significant runtime performance of the SparkSQL compared to RDD API.
| Year | Citations | |
|---|---|---|
Page 1
Page 1