Towards better understanding of gradient-based attribution methods for Deep Neural Networks

Abstract

Understanding the flow of information in Deep Neural Networks (DNNs) is a challenging problem that has gain increasing attention over the last few years.While several methods have been proposed to explain network predictions, there have been only a few attempts to compare them from a theoretical perspective.What is more, no exhaustive empirical comparison has been performed in the past.In this work, we analyze four gradient-based attribution methods and formally prove conditions of equivalence and approximation between them.By reformulating two of these methods, we construct a unified framework which enables a direct comparison, as well as an easier implementation.Finally, we propose a novel evaluation metric, called Sensitivity-n and test the gradient-based attribution methods alongside with a simple perturbation-based attribution method on several datasets in the domains of image and text classification, using various network architectures.