Random search for hyper-parameter optimization

Abstract

Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to configure neural networks and deep belief networks. Compared with neural networks configured by a pure grid search, we find that random search over the same domain is able to find models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search finds better models by effectively searching a larger, less promising configuration space. Compared with deep belief networks configured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional configuration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes

References

Page 1

	Year	Citations
Gradient-based learning applied to document recognition Yann LeCun, Léon Bottou, Yoshua Bengio, Proceedings of the IEEE EngineeringMachine LearningMultilayer Neural NetworksImage AnalysisData Science	1998	56.5K
Optimization by Simulated Annealing Scott Kirkpatrick, C. D. Gelatt, M.P. Vecchi Science Numerical AnalysisLarge-scale Global OptimizationComputational ScienceStatistical MechanicsEngineering	1983	44K
LIBSVM Chih-Chung Chang, Chih‐Jen Lin ACM Transactions on Intelligent Systems and Technology Data ClassificationSupport Vector MachineClassification MethodImage AnalysisMachine Vision	2011	41.1K
A Simplex Method for Function Minimization J. A. Nelder, R. Mead The Computer Journal Numerical AnalysisMathematical ProgrammingLarge-scale Global OptimizationEngineeringUnconstrained Optimization	1965	28.5K
Neural networks for pattern recognition Choice Reviews Online Recurrent Neural NetworkEngineeringMachine LearningData ScienceComputational Learning Theory	1994	18.7K
A Fast Learning Algorithm for Deep Belief Nets Geoffrey E. Hinton, Simon Osindero, Yee‐Whye Teh Neural Computation	2006	16.2K
Understanding the difficulty of training deep feedforward neural networks Xavier Glorot, Yoshua Bengio	2010	12.6K
A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code Michael D. McKay, Richard J. Beckman, W. J. Conover Technometrics EngineeringSample MeanMonte Carlo MethodsSampling TechniqueSoftware Engineering	2000	7.5K
Extracting and composing robust features with denoising autoencoders Pascal Vincent, Hugo Larochelle, Yoshua Bengio, EngineeringMachine LearningAutoencodersRobust FeaturesRobust Feature	2008	7.2K
Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES) Nikolaus Hansen, Sibylle D. Müller, Petros Koumoutsakos Evolutionary Computation	2003	2.5K

Page 1