26^{th}JUNE2018

Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Dan Boneh, Patrick D. McDaniel. *Ensemble Adversarial Training: Attacks and Defenses*. CoRR abs/1705.07204, 2017.

Also find this summary on ShortScience.org.

What is **your opinion** on the summarized work? Or do you know related work that is of interest? **Let me know** your thoughts in the comments below:

Tramèr et al. introduce both a novel adversarial attack as well as a defense mechanism against black-box attacks termed ensemble adversarial training. I first want to highlight that – in addition to the proposed methods – the paper gives a very good discussion of state-of-the-art attacks as well as defenses and how to put them into context. Tramèr et al. consider black-box attacks, focussing on transferrable adversarial examples. Their main observation is as follows: one-shot attacks (i.e. one evaluation of the model's gradient) on adversarially trained models are likely to overfit to the model's training loss. This observation has two aspects that are experimentally validated in the paper. First, the loss of the adversarially trained model increases sharply when considering adversarial examples crafted on a different model; second, the network learns to fool the attacker by, locally, misleading the gradient – this means that perturbations computed on adversarially trained models are specialized to the local loss. These observations are also illustrated in Figure 1, however, I refer to the paper for a detailed discussion.

Figure 1: Illustration of the discussed observations. On the left, the loss function of an adversarially trained model considering a sample $x = x + \epsilon_1 x' + \epsilon_2 x''$ where $x'$ is a perturbation computed on the adversarially trained model and $x''$ is a perturbation computed on a different model. On the right, zoomed in version where it can be seen that the loss rises sharply in the direction of $\epsilon_1$; i.e. the model gives misleading gradients.

Based on the above observations, Tramèr et al. First introduce a new one-shot attack exploiting the fact that the adversarially trained model is trained on overfitted perturbations and second introduce a new counter-measure for training more robust networks. Their attack is quite simple; they consider one Fast-Gradient Sign Method (FSGM) step, but apply a random perturbation first to leave the local vicinity of the sample first:

$x' = x + \alpha \text{sign}(\mathcal{N}(0, I))$

$x'' = x' + (\epsilon - \alpha)\text{sign}(\nabla_{x'} J(x', y))$

where $J$ is the loss function and $y$ the label corresponding to sample $x$. In experiments, they show that the attack has higher success rates on adversarially trained models.

To counter the proposed attack, they propose ensemble adversarial training. The key idea is to train the model utilizing not only adversarial samples crafted on the model itself but also transferred from pre-trained models. On MNIST, for example, they randomly select 64 FGSM samples from 4 different models (including the one in training). Experimentally, they show that ensemble adversarial training improves the defense again all considered attacks, including FGSM, iterative FGSM as well as the proposed attack.