Learning to Confuse: Generating Training Time Adversarial Data with\n Auto-Encoder

Abstract

In this work, we consider one challenging training time attack by modifying\ntraining data with bounded perturbation, hoping to manipulate the behavior\n(both targeted or non-targeted) of any corresponding trained classifier during\ntest time when facing clean samples. To achieve this, we proposed to use an\nauto-encoder-like network to generate the pertubation on the training data\npaired with one differentiable system acting as the imaginary victim\nclassifier. The perturbation generator will learn to update its weights by\nwatching the training procedure of the imaginary classifier in order to produce\nthe most harmful and imperceivable noise which in turn will lead the lowest\ngeneralization power for the victim classifier. This can be formulated into a\nnon-linear equality constrained optimization problem. Unlike GANs, solving such\nproblem is computationally challenging, we then proposed a simple yet effective\nprocedure to decouple the alternating updates for the two networks for\nstability. The method proposed in this paper can be easily extended to the\nlabel specific setting where the attacker can manipulate the predictions of the\nvictim classifiers according to some predefined rules rather than only making\nwrong predictions. Experiments on various datasets including CIFAR-10 and a\nreduced version of ImageNet confirmed the effectiveness of the proposed method\nand empirical results showed that, such bounded perturbation have good\ntransferability regardless of which classifier the victim is actually using on\nimage data.\n