Interpretation of Neural Networks Is Fragile

TLDR

Reliable explanations are essential for trusting machine learning, yet the robustness of neural‑network interpretation methods to small systematic perturbations remains largely unexamined. This study demonstrates how to craft imperceptible adversarial perturbations that preserve the predicted label while producing markedly different feature‑importance interpretations. We systematically evaluate the robustness of several widely‑used interpretation techniques—feature‑importance maps, integrated gradients, DeepLIFT, and exemplar‑based influence functions—on ImageNet and CIFAR‑10, and analyze the Hessian geometry underlying their vulnerability. Our experiments reveal that such perturbations can drastically alter interpretations without affecting predictions, and the Hessian analysis explains why robustness is a pervasive challenge for current methods.

Abstract

In order for machine learning to be trusted in many applications, it is critical to be able to reliably explain why the machine learning algorithm makes certain predictions. For this reason, a variety of methods have been developed recently to interpret neural network predictions by providing, for example, feature importance maps. For both scientific robustness and security reasons, it is important to know to what extent can the interpretations be altered by small systematic perturbations to the input data, which might be generated by adversaries or by measurement biases. In this paper, we demonstrate how to generate adversarial perturbations that produce perceptively indistinguishable inputs that are assigned the same predicted label, yet have very different interpretations. We systematically characterize the robustness of interpretations generated by several widely-used feature importance interpretation methods (feature importance maps, integrated gradients, and DeepLIFT) on ImageNet and CIFAR-10. In all cases, our experiments show that systematic perturbations can lead to dramatically different interpretations without changing the label. We extend these results to show that interpretations based on exemplars (e.g. influence functions) are similarly susceptible to adversarial attack. Our analysis of the geometry of the Hessian matrix gives insight on why robustness is a general challenge to current interpretation approaches.

References

Page 1

	Year	Citations

Page 1