Complex Data Imputation by Auto-Encoders and Convolutional Neural Networks—A Case Study on Genome Gap-Filling

Abstract

Missing data imputation has been a hot topic in the past decade, and many state-of-the-art works have been presented to propose novel, interesting solutions that have been applied in a variety of fields. In the past decade, the successful results achieved by deep learning techniques have opened the way to their application for solving difficult problems where human skill is not able to provide a reliable solution. Not surprisingly, some deep learners, mainly exploiting encoder-decoder architectures, have also been designed and applied to the task of missing data imputation. However, most of the proposed imputation techniques have not been designed to tackle “complex data”, that is high dimensional data belonging to datasets with huge cardinality and describing complex problems. Precisely, they often need critical parameters to be manually set or exploit complex architecture and/or training phases that make their computational load impracticable. In this paper, after clustering the state-of-the-art imputation techniques into three broad categories, we briefly review the most representative methods and then describe our data imputation proposals, which exploit deep learning techniques specifically designed to handle complex data. Comparative tests on genome sequences show that our deep learning imputers outperform the state-of-the-art KNN-imputation method when filling gaps in human genome sequences.

References

Page 1

	Year	Citations
ImageNet classification with deep convolutional neural networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Communications of the ACM Convolutional Neural NetworkEngineeringMachine LearningNeural NetworkImagenet Classification	2017	75.5K
Scikit-learn: Machine Learning in Python Fabián Pedregosa, Gaël Varoquaux, Alexandre Gramfort, arXiv (Cornell University)	2012	63.3K
Maximum Likelihood from Incomplete Data Via the <i>EM</i> Algorithm A. P. Dempster, N. M. Laird, Donald B. Rubin Journal of the Royal Statistical Society Series B (Statistical Methodology) Statistical Signal ProcessingMixture DistributionEngineeringData ScienceIncompleteness	1977	49.2K
WGCNA: an R package for weighted correlation network analysis Peter Langfelder, Steve Horvath BMC Bioinformatics	2008	27.8K
Initial sequencing and analysis of the human genome Eric S. Lander, Lauren Linton, Bruce W. Birren, Nature	2001	24.3K
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe, Christian Szegedy arXiv (Cornell University) Data AugmentationDeep Neural NetworksMachine VisionMachine LearningData Science	2015	24.2K
Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair, Geoffrey E. Hinton International Conference on Machine Learning Convolutional Neural NetworkEngineeringMachine LearningAutoencodersRecurrent Neural Network	2010	13.2K
<b>mice</b>: Multivariate Imputation by Chained Equations in<i>R</i> Stef van Buuren, Karin Groothuis‐Oudshoorn Journal of Statistical Software R PackageEngineeringLatent ModelingData SciencePassive Imputation	2011	12.8K
Principal component analysis Svante Wold, Kim H. Esbensen, Paul Geladi Chemometrics and Intelligent Laboratory Systems EngineeringPattern RecognitionKnowledge DiscoveryMultilinear Subspace LearningIndependent Component Analysis	1987	11.4K
Random search for hyper-parameter optimization James Bergstra, Yoshua Bengio	2012	7.9K

Page 1