Explaining Explanations in AI

TLDR

Interpretability research has largely produced simplified surrogate models that approximate the decision logic of complex AI systems, serving as pedagogical tools for predicting decisions and diagnosing failures, yet Box’s maxim reminds us that all models are wrong but some are useful. The authors aim to distinguish between surrogate models and philosophical/sociological explanations, contrasting schools of thought on what constitutes an explanation and proposing that machine learning could benefit from a broader perspective. They describe surrogate models as do‑it‑yourself kits that enable practitioners to answer counterfactual or contrastive questions without external assistance. They find that while surrogate models provide valuable explanatory power, delivering them as explanations is more difficult than necessary, and alternative explanation forms may lack similar trade‑offs, suggesting that machine learning could benefit from a wider explanatory framework.

Abstract

Recent work on interpretability in machine learning and AI has focused on the building of simplified models that approximate the true criteria used to make decisions. These models are a useful pedagogical device for teaching trained professionals how to predict what decisions will be made by the complex system, and most importantly how the system might break. However, when considering any such model it's important to remember Box's maxim that "All models are wrong but some are useful." We focus on the distinction between these models and explanations in philosophy and sociology. These models can be understood as a "do it yourself kit" for explanations, allowing a practitioner to directly answer "what if questions" or generate contrastive explanations without external assistance. Although a valuable ability, giving these models as explanations appears more difficult than necessary, and other forms of explanation may not have the same trade-offs. We contrast the different schools of thought on what makes an explanation, and suggest that machine learning might benefit from viewing the problem more broadly.

References

Page 1

	Year	Citations
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Natural Language ProcessingMultimodal LlmConvolutional Neural NetworkMachine VisionMachine Learning	2017	20.1K
"Why Should I Trust You?" Marco Túlio Ribeiro, Sameer Singh, Carlos Guestrin Artificial IntelligenceEngineeringMachine LearningTrust Management ArchitectureVerification	2016	14K
On a Method to Measure Supervised Multiclass Model’s Interpretability: Application to Degradation Diagnosis (Short Paper) Dagstuhl Research Online Publication Server	2024	13.1K
Explaining and Harnessing Adversarial Examples Ian Goodfellow, Jonathon Shlens, Christian Szegedy arXiv (Cornell University) Artificial IntelligenceData AugmentationMaxout NetworkEngineeringMachine Learning	2014	8.1K
A Unified Approach to Interpreting Model Predictions Scott Lundberg, Su‐In Lee arXiv (Cornell University) Artificial IntelligenceEngineeringMachine LearningMachine Learning ToolData Science	2017	7.6K
Deep Inside Convolutional Networks: Visualising Image Classification\n Models and Saliency Maps Karen Simonyan, Andrea Vedaldi, Andrew Zisserman arXiv (Cornell University)	2013	4.9K
Methods for interpreting and understanding deep neural networks Grégoire Montavon, Wojciech Samek, Klaus‐Robert Müller Digital Signal Processing	2017	2.6K
Learning Important Features Through Propagating Activation Differences Avanti Shrikumar, Peyton Greenside, Anshul Kundaje arXiv (Cornell University) Artificial IntelligenceEngineeringMachine LearningNeural Networks (Machine Learning)Neural Network	2017	2.4K
How the machine ‘thinks’: Understanding opacity in machine learning algorithms Jenna Burrell Big Data & Society Artificial IntelligenceEngineeringMachine LearningBusiness IntelligenceInformation Security	2016	2.2K
Big Data's Disparate Impact Solon Barocas, Andrew D. Selbst SSRN Electronic Journal	2016	2.1K

Page 1