# DAVIDSTUTZ

## Updated ArXiv Pre-Print “Confidence-Calibrated Adversarial Training”

Adversarial training yields robust models against a specific threat model. However, robustness does not generalize to larger perturbations or threat models not seen during training. Confidence-calibrated adversarial training tackles this problem by biasing the network towards low-confidence predictions on adversarial examples. Through rejecting low-confidence (adversarial) examples, robustness generalizes to various threat models, including L2, L1 and L0 while training only on L∞ adversarial examples. This article gives a short abstract, discusses relevant updates to the previous version and includes paper and appendix.

This paper was accepted at ICML'20!.

This is a substantial update to the previous version, including additional experiments with $L_2$, $L_1$ and $L_0$ adversarial examples as well as adversarial frames.

### Abstract

Adversarial training yields robust models against a specific threat model, e.g., $L_\infty$ adversarial examples. Typically robustness does not generalize to previously unseen threat models, e.g., other $L_p$ norms, or larger perturbations. Our confidence-calibrated adversarial training (CCAT) tackles this problem by biasing the model towards low confidence predictions on adversarial examples. By allowing to reject examples with low confidence, robustness generalizes beyond the threat model employed during training. CCAT, trained only on $L_\infty$ adversarial examples, increases robustness against larger $L_\infty$, $L_2$, $L_1$ and $L_0$ attacks, adversarial frames, distal adversarial examples and corrupted examples and yields better clean accuracy compared to adversarial training. For thorough evaluation we developed novel white- and black-box attacks directly attacking CCAT by maximizing confidence. For each threat model, we use $7$ attacks with up to $50$ restarts and $5000$ iterations and report worst-case robust test error, extended to our confidence-thresholded setting, across all attacks.

@article{Stutz2019ARXIV,
author    = {David Stutz and Matthias Hein and Bernt Schiele},
title     = {Confidence-Calibrated Adversarial Training: Generalizing to Unseen Attacks},
journal   = {CoRR},
volume    = {abs/1910.06259},
year      = {2019}
}


Besides significantly improved writing, the paper includes the following improvements:

• Results with additional $L_2$, $L_1$ and $L_0$ attacks, including the Square attack [], the Corner Search attack [], the Geometry attack [].
• Results with adversarial frames [].
• Comparison to the multi-steepest descent (MSD) adversarial training [], i.e., training with $L_\infty$, $L_2$ and $L_1$ adversarial examples.
• Comparison to the Mahalanobis and Local Intrinsic Dimentionality detectors [][], which were "cracked" using our thorough evaluation.

The code, including training procedures and all attacks, to reproduce the results in the paper will be provided on GitHub soon.

• [] Andriushchenko, M., Croce, F., Flammarion, N., and Hein, M. Square attack: a query-efficient black-box adversarial attack via random search. arXiv.org, 1912.00049, 2019.
• [] Croce, F. and Hein, M. Sparse and imperceivable adversarial attacks. arXiv.org, abs/1909.05040, 2019.
• [] Khoury, M. and Hadfield-Menell, D. On the geometry of adversarial examples. arXiv.org, abs/1811.00525, 2018.
• [] Zajac, M., Zolna, K., Rostamzadeh, N., and Pinheiro, P. O. Adversarial framing for image and video classification. In AAAI Workshops, 2019.
• [] Maini, P., Wong, E., and Kolter, J. Z. Adversarial robustness against the union of multiple perturbation models. arXiv.org, abs/1909.04068, 2019.
• [] Lee, K., Lee, K., Lee, H., and Shin, J. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In NeurIPS, 2018.
• [] Ma, X., Li, B., Wang, Y., Erfani, S. M., Wijewickrema, S. N. R., Schoenebeck, G., Song, D., Houle, M. E., and Bailey, J. Characterizing adversarial subspaces using local intrinsic dimensionality. In ICLR, 2018.