Beilun Wang, Ji Gao, Yanjun Qi. A Theoretical Framework for Robustness of (Deep) Classifiers against Adversarial Samples. ICLR (Workshop) 2017.

Wang et al. discuss an alternative definition of adversarial examples, taking into account an oracle classifier. Adversarial perturbations are usually constrained in their norm (e.g., $L_\infty$ norm for images); however, the main goal of this constraint is to ensure label invariance – if the image didn’t change notable, the label didn’t change either. As alternative formulation, the authors consider an oracle for the task, e.g., humans for image classification tasks. Then, an adversarial example is defined as a slightly perturbed input, whose predicted label changes, but where the true label (i.e., the oracle’s label) does not change. Additionally, the perturbation can be constrained in some norm; specifically, the perturbation can be constrained on the true manifold of the data, as represented by the oracle classifier. Based on this notion of adversarial examples, Wang et al. argue that deep neural networks are not robust as they utilize over-complete feature representations.

Also find this summary on ShortScience.org.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.