Efficient Large Scale NLP Feature Engineering with Apache Spark

Abstract

Feature engineering is a computationally time-consuming process in the end-to-end machine learning pipeline. Large amounts of text data are being generated on many heterogeneous sources and platforms on the internet. The compute resources needed to extract valuable features from these big datasets are increasing significantly. In this research, we evaluate the runtime of the RDD and the Spark-SQL APIs of the Apache Spark framework to extract text features from the corpus of english Wikipedia. As a result, we demonstrate the significant runtime performance of the SparkSQL compared to RDD API.

References

Page 1

	Year	Citations

Page 1