Towards a Definition for Adversarial Examples

Obtaining deep networks robust against adversarial examples is a widely open problem. While many papers are devoted to training more robust deep networks, a clear definition of adversarial examples has not been agreed upon. In this article, I want to discuss two very simple toy examples illustrating the necessity of a proper definition of adversarial examples.


The robustness of deep neural networks against so-called adversarial examples, imperceptibly perturbed images that cause mis-classification, has received considerable attention in the last few years — as can be seen in my earlier survey article. However, only few defenses against adversarial examples have been shown to be effective [][], leaving the problem largely unsolved — even for simple datasets such as MNIST [].

So far, adversarial training is — to the best of my knowledge — the only defense mechanism that has been repeatedly shown to be effective. However, training on adversarial examples often leads to reduced accuracy. This observation has led to some work on the relationship between adversarial robustness and accuracy [][][]. In [], for example, a simple toy dataset is used to show that adversarial robustness and generalization might be contradicting goals.

In this article, I argue that a proper definition of adversarial examples is particularly important when studying adversarial robustness in the context of toy datasets. While small image perturbations, usually bounded through a $L_p$ norm, will rarely change the actual, true label; it is significantly more difficult to ensure label-invariance on simpler toy datasets — especially when not considering images.

The following discussion is largely based on Appendix I of our ArXiv pre-print.

Figure 1: Adversarial examples on EMNIST and Fashion-MNIST. Top: original image, middle: adversarial example, bottom: difference between adversarial example and original image (normalized).

Background. Given a data distribution $p(x, y)$ over images $x$ and labels $y$ and a deep neural network $f$ with $f(x) = y$, an adversarial example is a perturbed image $\tilde{x} = x + \delta$ such that $f(\tilde{x}) \neq y$, that is, $\tilde{x}$ is mis-classified. With access to $f$'s parameters and gradients, the adversarial example $\tilde{x}$ can be computed by directly maximizing the cross-entropy loss:

$\max_\delta \mathcal{L}(f(x + \delta), y)$ such that $\|\delta\|_\infty \leq \epsilon$ and $\tilde{x}_i \in [0,1]$.

Note that the $L_\infty$ norm can also be replaced by any other $L_p$ norm. Here, the $\epsilon$-constraint ensures label invariance: perturbations smaller than $\epsilon$ will never change the true label of the image. In practice, projected gradient descent can be used to solve this problem []. Some examples on EMNIST [] and Fashion-MNIST [] are shown in Figure 1. Note that the computed perturbations $\delta = \tilde{x} - x$ exhibit (seemingly) random noise patterns without much structure.

Example and Definition

As shown in Figure 1, the $\epsilon$-constraint is effective in ensuring label-invariance on images. However, for other modalities this strategy might not work as well. For text, for example, adversarial examples are — strictly speaking — not imperceptible anymore. Still, flipping individual characters in long sentences does (usually) not alter the semantics.

For more abstract toy examples, however, label-invariance needs to be based on the data distribution $p$. For example, considering a binary classification problem with labels $y \in \{-1,1\}$, $p(y = 1) = p(y = -1) = 0.5$, and observations $x$ drawn from:

$p(x = 0|y = 1) = 1$ and $p(x = \epsilon|y = -1) = 1$

In words, the data distribution consists of two point passes for the pairs $(0, 1)$ and $(\epsilon, -1)$. This problem is linearly separable for any $\epsilon > 0$. However, it seems that no classifier can be adversarially robust against an $\epsilon$-constrained adversary. For example, let the observation $x = 0$ with $y = 1$ and the adversarial example $\tilde{x} = x + \epsilon = \epsilon$. This adversarial example would fool any classifier that achieves linear separation; implying that no "good" classifier can be robust on this problem. Note, however, that a constant classifier will be robust — but not perform better than chance.

In the above example, an $\epsilon$-constraint on the absolute value of the perturbation does not ensure label-invariance as the adversarial example $\tilde{x}$ is, by construction, more likely to have label $y = -1$:

$1 = p(y = -1 | x = \epsilon) > p(y = 1| x = \epsilon) = 0$.

This suggests that a proper adversarial example has to change the predicted label, but must not change the label with respect to the data distribution. This example, however, has a small caveat: the data distribution is zero for any $\tilde{x} \notin \{0, \epsilon\}$ — and it is not guaranteed that standard attacks will produce adversarial examples with non-zero probability. A simple strategy to assign true labels to any $\tilde{x}$ is using an orthogonal projection onto the support of the data distribution; in our example, any adversarial example $\tilde{x}$ will get assigned a label based on its proximity to $0$ or $\epsilon$:

$\pi(\tilde{x}) = \begin{cases}\epsilon&\text{if } |\tilde{x} - \epsilon| < |\tilde{x} - 0|\\0&\text{otherwise}\end{cases}$

Then, the max-margin classifier with threshold $\frac{\epsilon}{2}$ will be accurate and robust when only allowing adversarial examples adhering to the following definition:

Definition. Let $p$ be a data distribution and $x$ be a sample with label $y$ such that $p(x, y) > 0$ and $f(x) = y$ for some neural network $f$. Then, $\tilde{x}$ is an adversarial example if $f(\tilde{x}) \neq y$ but $p(y|\pi(\tilde{x})) > p(y'|\pi(\tilde{x}))$ for any $y' \neq y$ with $\pi(\tilde{x})$ being the orthogonal projection of $\tilde{x}$ onto the support of $p$.

Discussion of []

Figure 2: Illustration of the toy dataset proposed in [] for $p = 0.9$ and $\eta = 3$; see the text for details.

In the following, I want to discuss a particularly interesting toy example first considered in []. For labels $y \in \{-1,1\}$ with $p(y = 1) = p(y = -1) = 0.5$ let the observations $x \in \{-1,1\} \times \mathbb{R}$ be drawn as follows:

$p(x_1|y) = \begin{cases}p & \text{if }x_1 = y\\1 - p&\text{if }x_1 \neq y\end{cases}$

$p(x_2|y) = \mathcal{N}(x_2; y\eta, 1)$

where $p$ and $\eta$ are parameters of the dataset; specifically, $p$ defines how reliable $x_1$ is in predicting the label, and $\eta$ defines the overlap of both distributions over $x_2$. For illustration, Figure 2 shows the case of $p = 0.9$ and $\eta = 3$.

For an $L_\infty$-bounded adversary with $\epsilon \geq 2 \eta$, that is, an attacker that can manipulate the samples by $2\eta$ or more per dimension, Tsipras et al. [] show that no classifier can be both robust and accurate on this dataset. For example, let $y = 1$ but $x_1 = -1$ and $x_2 = \eta$. Then, the adversary in [] replaces $x_2$ with $\tilde{x}_2 = x_2 - 2\eta = -\eta$. Then:

$p(y = 1 | x = \tilde{x})$

$= p(y = 1 | x_1 = -1) \cdot p(y = 1 | x_2 = - \eta)$

$= (1 - p) \cdot \mathcal{N}(x_2 = - \eta; \eta, 1)$

$\not> p \cdot \mathcal{N}(x_2 = - \eta; -\eta, 1)$

$= p(y = -1 | x_1 = -1) \cdot p(y = -1 | x_2 = - \eta)$

$= p(y=-1 | x = \tilde{x})$

which contradicts our definition of adversarial examples. However, I want to note, that the adversarial example is valid when considered in the context of the adversarial loss as defined in []. Overall, this example illustrates that different intuitions and definitions of adversarial examples may lead to widely different results — not only on such toy examples.


In this article, I wanted to give a brief intuition of adversarial examples in the context of toy datasets where label-invariance cannot be ensured using norm-constraints. Instead, I argued that label-invariance needs to be enforced through the underlying data distribution. In particular, an adversarial example should fool the classifier, but must not change the label with respect to the data distribution. As adversarial examples are not guaranteed to have non-zero probability under the data distribution, I suggested considering the orthogonal projection onto the support of the data distribution. On toy datasets, I demonstrated that this definition of adversarial examples makes a significant difference.


  • [] A. Athalye and N. Carlini. On the robustness of the CVPR 2018 white-box adversarial example defenses. arXiv.org, abs/1804.03286, 2018.
  • [] A. Athalye, N. Carlini, and D. A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv.org, abs/1802.00420, 2018.
  • [] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 6
  • [] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik. EMNIST: an extension of MNIST to handwritten letters. arXiv.org, abs/1702.05373, 2017.
  • [] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv.org, abs/1708.07747, 2017.
  • [] J. Gilmer, L. Metz, F. Faghri, S. S. Schoenholz, M. Raghu, M. Wattenberg, and I. Goodfellow. Adversarial spheres. ICLR Workshops, 2018.
  • [] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry. Robustness may be at odds with accuracy. arXiv.org, abs/1805.12152, 2018.
  • [] D. Su, H. Zhang, H. Chen, J. Yi, P.-Y. Chen, and Y. Gao. Is robustness the cost of accuracy? – a comprehensive study on the robustness of 18 deep image classification models. arXiv.org, abs/1808.01688, 2018.
  • [] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. ICLR, 2018.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.