Tsipras et al. investigate the trade-off between classification accuracy and adversarial robustness. In particular, on a very simple toy dataset, they proof that such a trade-off exists; this means that very accurate models will also have low robustness. Overall, on this dataset, they find that there exists a sweet-spot where the accuracy is 70% and the adversarial accuracy (i.e., accuracy on adversarial examples) is 70%. Using adversarial training to obtain robust networks, they additionally show that the robustness is increased by not using “fragile” features – features that are only weakly correlated with the actual classification tasks. Only focusing on few, but “robust” features also has the advantage of more interpretable gradients and sparser weights (or convolutional kernels). Due to the induced robustness, adversarial examples are perceptually significantly more different from the original examples, as illustrated in Figure 1 on MNIST.
What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: