Goodfellow et al. introduce the fast gradient sign method (FGSM) to craft adversarial examples and further provide a possible interpretation of adversarial examples considering linear models. FGSM is a grdient-based, one step method for generating adversarial examples. In particular, letting $J$ be the objective optimized during training and $\epsilon$ be the maximum $\infty$-norm of the adversarial perturbation, FGSM computes
$x' = x + \eta = x + \epsilon \text{sign}(\nabla_x J(x, y))$
where $y$ is the label for sample $x$. The $\text{sign}$ method is applied element-wise here. The applicability of this method is shown in several examples and it is commonly used in related work.
In the remainder of the paper, Goodfellow et al. discuss a linear interpretation of why adversarial examples exist. Specifically, considering the dot product
$w^T x' = w^T x + w^T \eta$
it becomes apparent that the perturbation $\eta$ – although insignificant on a per-pixel level (i.e. smaller than $\epsilon$) – causes the activation of a single neuron to be influence significantly. What is more, this effect is more pronounced the higher the dimensionality of $x$. Additionally, many network architectures today use $\text{ReLU}$ activations, which are essentially linear.
Goodfellow et al. conduct several more experiments; I want to highlight the conclusions of some of them:
Training on adversarial samples can be seen as regularization. Based on experiments, it is more effective than $L_1$ regularization or adding random noise.
The direction of the perturbation matters most. Adversarial samples might be transferable as similar models learn similar functions where these directions are, thus, similarly effective.
Ensembles are not necessarily resistant to perturbations.
Goodfellow et al. introduce the fast gradient sign method (FGSM) to craft adversarial examples and further provide a possible interpretation of adversarial examples considering linear models. FGSM is a grdient-based, one step method for generating adversarial examples. In particular, letting $J$ be the objective optimized during training and $\epsilon$ be the maximum $\infty$-norm of the adversarial perturbation, FGSM computes
$x' = x + \eta = x + \epsilon \text{sign}(\nabla_x J(x, y))$
where $y$ is the label for sample $x$. The $\text{sign}$ method is applied element-wise here. The applicability of this method is shown in several examples and it is commonly used in related work.
In the remainder of the paper, Goodfellow et al. discuss a linear interpretation of why adversarial examples exist. Specifically, considering the dot product
$w^T x' = w^T x + w^T \eta$
it becomes apparent that the perturbation $\eta$ – although insignificant on a per-pixel level (i.e. smaller than $\epsilon$) – causes the activation of a single neuron to be influence significantly. What is more, this effect is more pronounced the higher the dimensionality of $x$. Additionally, many network architectures today use $\text{ReLU}$ activations, which are essentially linear.
Goodfellow et al. conduct several more experiments; I want to highlight the conclusions of some of them: