Dynamical Isometry and a Mean Field Theory of CNNs: How to Train\n 10,000-Layer Vanilla Convolutional Neural Networks

Abstract

In recent years, state-of-the-art methods in computer vision have utilized\nincreasingly deep convolutional neural network architectures (CNNs), with some\nof the most successful models employing hundreds or even thousands of layers. A\nvariety of pathologies such as vanishing/exploding gradients make training such\ndeep networks challenging. While residual connections and batch normalization\ndo enable training at these depths, it has remained unclear whether such\nspecialized architecture designs are truly necessary to train deep CNNs. In\nthis work, we demonstrate that it is possible to train vanilla CNNs with ten\nthousand layers or more simply by using an appropriate initialization scheme.\nWe derive this initialization scheme theoretically by developing a mean field\ntheory for signal propagation and by characterizing the conditions for\ndynamical isometry, the equilibration of singular values of the input-output\nJacobian matrix. These conditions require that the convolution operator be an\northogonal transformation in the sense that it is norm-preserving. We present\nan algorithm for generating such random initial orthogonal convolution kernels\nand demonstrate empirically that they enable efficient training of extremely\ndeep architectures.\n