Fine-Grained Analysis of Optimization and Generalization for\n Overparameterized Two-Layer Neural Networks

Abstract

Recent works have cast some light on the mystery of why deep nets fit any\ndata and generalize despite being very overparametrized. This paper analyzes\ntraining and generalization for a simple 2-layer ReLU net with random\ninitialization, and provides the following improvements over recent works:\n (i) Using a tighter characterization of training speed than recent papers, an\nexplanation for why training a neural net with random labels leads to slower\ntraining, as originally observed in [Zhang et al. ICLR'17].\n (ii) Generalization bound independent of network size, using a data-dependent\ncomplexity measure. Our measure distinguishes clearly between random labels and\ntrue labels on MNIST and CIFAR, as shown by experiments. Moreover, recent\npapers require sample complexity to increase (slowly) with the size, while our\nsample complexity is completely independent of the network size.\n (iii) Learnability of a broad class of smooth functions by 2-layer ReLU nets\ntrained via gradient descent.\n The key idea is to track dynamics of training and generalization via\nproperties of a related kernel.\n