Semantically Equivalent Adversarial Rules for Debugging NLP models

TLDR

NLP models are brittle, producing different predictions for semantically similar inputs. The study introduces semantically equivalent adversaries (SEAs) to automatically detect brittle behavior in individual instances. SEAs are generalized into SEARs—simple universal replacement rules that generate adversarial examples across multiple instances and domains, enabling bug detection in black‑box models. User studies show SEARs generate more high‑quality adversaries than humans, uncover four times more bugs, and retraining with augmented data reduces bugs without harming accuracy.

Abstract

Complex machine learning models for NLP are often brittle, making different predictions for input instances that are extremely similar semantically. To automatically detect this behavior for individual instances, we present semantically equivalent adversaries (SEAs) – semantic-preserving perturbations that induce changes in the model’s predictions. We generalize these adversaries into semantically equivalent adversarial rules (SEARs) – simple, universal replacement rules that induce adversaries on many instances. We demonstrate the usefulness and flexibility of SEAs and SEARs by detecting bugs in black-box state-of-the-art models for three domains: machine comprehension, visual question-answering, and sentiment analysis. Via user studies, we demonstrate that we generate high-quality local adversaries for more instances than humans, and that SEARs induce four times as many mistakes as the bugs discovered by human experts. SEARs are also actionable: retraining models using data augmentation significantly reduces bugs, while maintaining accuracy.

References

Page 1

	Year	Citations

Page 1