On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

Abstract

Many tasks in machine learning and signal processing can be solved by\nminimizing a convex function of a measure. This includes sparse spikes\ndeconvolution or training a neural network with a single hidden layer. For\nthese problems, we study a simple minimization method: the unknown measure is\ndiscretized into a mixture of particles and a continuous-time gradient descent\nis performed on their weights and positions. This is an idealization of the\nusual way to train neural networks with a large hidden layer. We show that,\nwhen initialized correctly and in the many-particle limit, this gradient flow,\nalthough non-convex, converges to global minimizers. The proof involves\nWasserstein gradient flows, a by-product of optimal transport theory. Numerical\nexperiments show that this asymptotic behavior is already at play for a\nreasonable number of particles, even in high dimension.\n

References

Page 1

	Year	Citations

Page 1