Understanding Straight-Through Estimator in Training Activation\n Quantized Neural Nets

Abstract

Training activation quantized neural networks involves minimizing a piecewise\nconstant function whose gradient vanishes almost everywhere, which is\nundesirable for the standard back-propagation or chain rule. An empirical way\naround this issue is to use a straight-through estimator (STE) (Bengio et al.,\n2013) in the backward pass only, so that the "gradient" through the modified\nchain rule becomes non-trivial. Since this unusual "gradient" is certainly not\nthe gradient of loss function, the following question arises: why searching in\nits negative direction minimizes the training loss? In this paper, we provide\nthe theoretical justification of the concept of STE by answering this question.\nWe consider the problem of learning a two-linear-layer network with binarized\nReLU activation and Gaussian input data. We shall refer to the unusual\n"gradient" given by the STE-modifed chain rule as coarse gradient. The choice\nof STE is not unique. We prove that if the STE is properly chosen, the expected\ncoarse gradient correlates positively with the population gradient (not\navailable for the training), and its negation is a descent direction for\nminimizing the population loss. We further show the associated coarse gradient\ndescent algorithm converges to a critical point of the population loss\nminimization problem. Moreover, we show that a poor choice of STE leads to\ninstability of the training algorithm near certain local minima, which is\nverified with CIFAR-10 experiments.\n