Concepedia

Publication | Open Access

Energy–entropy competition and the effectiveness of stochastic gradient descent in machine learning

58

Citations

24

References

2018

Year

TLDR

Finding parameters that minimise a loss function is central to machine learning, and although stochastic gradient descent (SGD) is widely used and achieves state‑of‑the‑art results, it typically cannot reach the global minimum, making its empirical success mysterious and suggesting that wide but shallow minima may be optimal under undersampling, analogous to energy–entropy competition in physics. The study derives a correspondence between parameter inference in machine learning and free‑energy minimisation in statistical physics. The authors model SGD as free‑energy minimisation with undersampling acting as temperature, and illustrate this framework on deep‑learning image classification and a linear neural network to analytically link entropy to out‑of‑sample error. The authors find that SGD’s stochasticity has a non‑trivial correlation structure that biases it toward wide minima, and demonstrate this effect in deep‑learning image classification and a linear neural network, where entropy is analytically linked to out‑of‑sample error.

Abstract

Finding parameters that minimise a loss function is at the core of many machine learning methods. The Stochastic Gradient Descent (SGD) algorithm is widely used and delivers state-of-the-art results for many problems. Nonetheless, SGD typically cannot find the global minimum, thus its empirical effectiveness is hitherto mysterious. We derive a correspondence between parameter inference and free energy minimisation in statistical physics. The degree of undersampling plays the role of temperature. Analogous to the energy–entropy competition in statistical physics, wide but shallow minima can be optimal if the system is undersampled, as is typical in many applications. Moreover, we show that the stochasticity in the algorithm has a non-trivial correlation structure which systematically biases it towards wide minima. We illustrate our argument with two prototypical models: image classification using deep learning and a linear neural network where we can analytically reveal the relationship between entropy and out-of-sample error.

References

YearCitations

Page 1