Concepedia

Abstract

With the preponderance of harassment and abuse, social media platforms and online discussion platforms seek to curb toxic comments. Google's Perspective aims to help platforms classify toxic comments. We have created a pipeline to modify toxic comments to evade Perspective. This pipeline uses existing adversarial machine learning attacks to find the optimal perturbation which will evade the model. Since these attacks typically target images, as opposed to discrete text data, we include a process to generate text candidates from perturbed features and select candidates to retain syntactic similarity. We demonstrated that using a model with just 10,000 queries, changing three words in each comment evades Perspective 25% of the time, suggesting that building a surrogate model may not require many queries and a more robust approach is needed to improve the toxic comment classifier accuracy.

References

YearCitations

Page 1