Wu et al. propose a defense against adversarial examples based on the observation that (deep) neural networks learn confidence regions per class. In particular, their work is based on the assumption that neural networks learn separate manifolds for different classes. On these manifolds – according to the assumption – neural networks learn confidence regions where samples can be classified with high confidence. In a “good” model, these confidence regions should be – as the corresponding manifolds – well separated. They argue, theoretically, that adversarial training as employed by Madry et al.  helps to learn “good” models – i.e. the probability of an adversarial example with high confidence being found decreases. Taking confidence information into account, they propose a simple defense strategy: given a trained model (e.g. through adversarial training) and a test sample, we classify it according to the class of the most confident neighbor. This involves searching the neighborhood for confident examples; in practice, they employ a strategy similar to the Carlini Wagner attack . In experiments, they show that this defense strategy can significantly reduce the impact of adversarial attacks. Additionally, this work further highlights the “manifold interpretation” of adversarial examples, i.e. that the data manifold plays an important role when considering adversarial examples.
What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: