Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, Aleksander Madry. There Is No Free Lunch In Adversarial Robustness (But There Are Unexpected Benefits). CoRR abs/1805.12152 (2018).

Tsipras et al. investigate the trade-off between classification accuracy and adversarial robustness. In particular, on a very simple toy dataset, they proof that such a trade-off exists; this means that very accurate models will also have low robustness. Overall, on this dataset, they find that there exists a sweet-spot where the accuracy is 70% and the adversarial accuracy (i.e., accuracy on adversarial examples) is 70%. Using adversarial training to obtain robust networks, they additionally show that the robustness is increased by not using “fragile” features – features that are only weakly correlated with the actual classification tasks. Only focusing on few, but “robust” features also has the advantage of more interpretable gradients and sparser weights (or convolutional kernels). Due to the induced robustness, adversarial examples are perceptually significantly more different from the original examples, as illustrated in Figure 1 on MNIST.

Figure 1: Illustration of adversarial examples for a standard model, a model trained using $L_\infty$ adversarial training and $L_2$ adversarial training. Especially for the $L_2$ case it is visible that adversarial examples need to change important class characteristics to fool the network.

Also find this summary on ShortScience.org.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.