November 26, 2019. The paper has been updated, including a more thorough description of the experimental setup and evaluation metrics and presenting additional experiments with $L_0$ and $L_1$ attacks.
Abstract
Figure 1: The effect of confidence calibration on adversarial training. Left: confidence per class along an adversarial direction for adversarial training (AT) and the proposed confidence-calibrated adversarial training (CCAT). Right: confidence histogram for test and adversarial examples.
Adversarial training is the standard to train models robust against adversarial examples. However, especially for complex datasets, adversarial training incurs a significant loss in accuracy and is known to generalize poorly to stronger attacks, e.g., larger perturbations or other threat models. In this paper, we introduce confidence-calibrated adversarial training (CCAT) where the key idea is to enforce that the confidence on adversarial examples decays with their distance to the attacked examples. We show that CCAT preserves better the accuracy of normal training while robustness against adversarial examples is achieved via confidence thresholding. Most importantly, in strong contrast to adversarial training, the robustness of CCAT generalizes to larger perturbations and other threat models, not encountered during training. We also discuss our extensive work to design strong adaptive attacks against CCAT and standard adversarial training which is of independent interest. We present experimental results on MNIST, SVHN and Cifar10.
Paper on ArXiv@article{Stutz2019ARXIV, author = {David Stutz and Matthias Hein and Bernt Schiele}, title = {Confidence-Calibrated Adversarial Training: Towards Robust Models Generalizing Beyond the Attack Used During Training}, journal = {CoRR}, volume = {abs/1910.06259}, year = {2019} }