Fooling LIME and SHAP

TLDR

Machine‑learning black boxes are increasingly used in high‑stakes domains such as healthcare and criminal justice, prompting a need for interpretable explanations to diagnose systematic errors and biases. This study demonstrates that post‑hoc explanation methods that rely on input perturbations, like LIME and SHAP, are unreliable. The authors introduce a scaffolding technique that lets an adversary craft arbitrary explanations, masking a classifier’s bias while preserving its biased predictions. Evaluation on multiple real‑world datasets, including COMPAS, shows that highly biased classifiers can easily fool LIME and SHAP into producing innocuous explanations that conceal their underlying racism.

Abstract

As machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations are being leveraged by domain experts to diagnose systematic errors and underlying biases of black boxes. In this paper, we demonstrate that post hoc explanations techniques that rely on input perturbations, such as LIME and SHAP, are not reliable. Specifically, we propose a novel scaffolding technique that effectively hides the biases of any given classifier by allowing an adversarial entity to craft an arbitrary desired explanation. Our approach can be used to scaffold any biased classifier in such a way that its predictions on the input data distribution still remain biased, but the post hoc explanations of the scaffolded classifier look innocuous. Using extensive evaluation with multiple real world datasets (including COMPAS), we demonstrate how extremely biased (racist) classifiers crafted by our framework can easily fool popular explanation techniques such as LIME and SHAP into generating innocuous explanations which do not reflect the underlying biases.

References

Page 1

	Year	Citations
UCI Machine Learning Repository Arthur Asuncion Medical Entomology and Zoology EngineeringMachine LearningData ScienceData MiningPattern Recognition	2007	24.3K
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead Cynthia Rudin Nature Machine Intelligence Artificial IntelligenceEngineeringMachine LearningData ScienceUse Interpretable Models	2019	7.9K
A Unified Approach to Interpreting Model Predictions Scott Lundberg, Su‐In Lee arXiv (Cornell University) Artificial IntelligenceEngineeringMachine LearningMachine Learning ToolData Science	2017	7.6K
“Why Should I Trust You?”: Explaining the Predictions of Any Classifier Marco Ribeiro, Sameer Singh, Carlos Guestrin	2016	4.8K
Neural Information Processing Systems (NIPS) Xiaocheng Shang, Zhanxing Zhu, Benedict Leimkuhler, Intelligent Information ProcessingComputational NeuroscienceComputer ScienceBrain-like ComputingNeurocomputers	2015	835
Interpretability Beyond Feature Attribution: Quantitative Testing with\n Concept Activation Vectors (TCAV) Been Kim, Martin Wattenberg, Justin Gilmer, arXiv (Cornell University)	2017	732
Explaining Explanations in AI Brent Mittelstadt, Chris Russell, Sandra Wachter Artificial IntelligenceCognitive ScienceEngineeringMachine LearningExplanation-based Learning	2019	696
Interpretation of Neural Networks Is Fragile Amirata Ghorbani, Abubakar Abid, James Zou Proceedings of the AAAI Conference on Artificial Intelligence Artificial IntelligenceEngineeringMachine LearningNeurolinguisticsAi Safety	2019	653
The Intuitive Appeal of Explainable Machines Andrew D. Selbst, Solon Barocas SSRN Electronic Journal	2018	338
A data-driven software tool for enabling cooperative information sharing among police departments Michael Redmond, Alok Baveja European Journal of Operational Research EngineeringCooperative Information SystemData ScienceData-driven Software ToolManagement	2002	197

Page 1