# DAVIDSTUTZ

## Talk on Confidence-Calibrated Adversarial Training at BCAI and Tübingen AI Center

Recently, I had the opportunity to present my work on confidence-calibrated adversarial training at the Bosch Center for Artifical Intelligence and the University of Tübingen, specifically, the newly formed Tübingen AI Center. As part of the talk, I outlined the motivation and strengths of confidence-calibrated adversarial training compared to standard adversarial training: robustness against previously unseen attacks and improved accuracy. I also touched on the difficulties faced during robustness evaluation. This article provides the corresponding slides and gives a short overview of the talk.

### Abstract

Adversarial training (AT), i.e., training on adversarial examples generating on-the-fly, is standard to obtain robust models within a specific threat model, e.g., $L_\infty$ adversarial examples. However, robustness does not generalize to previously unseen attacks such as larger perturbations or other $L_p$ threat models. Furthermore, adversarial training often incurs a drop in accuracy. Confidence-calibrated adversarial training (CCAT) tackles these problems by biasing the network towards low-confidence predictions on adversarial examples. Trained only on $L_\infty$ adversarial examples, CCAT improves robustness against unseen attacks, including $L_2$, $L_1$ and $L_0$ adversarial examples as well as adversarial frames by rejecting low-confidence (adversarial) examples. Additionally, compared to AT, accuracy is improved.

This talk motivates CCAT by the observation that adversarial examples usually leave the underlying manifold of the data, see Figure 2. By encouraging low-confidence predictions on adversarial examples, i.e., off-manifold, the model is biased to extrapolate this behavior to arbitrary regions. The hypothesis is that the robustness of standard AT does not generalize well to unseen attacks as high-confidence predictions cannot be extrapolated to arbitrary regions in a meaningful way. Furthermore, high-confidence predictions are problematic if the $\epsilon$-balls used during training overlap, e.g., for training examples from different classes. Both problems are adressed by predicting uniform confidence within the largest parts of the $\epsilon$-balls, as encouraged by CCAT. Both cases are illustrated in Figure 1.

Finally, the talk also includes three important lessons for properly evaluating CCAT:

1. Use proper evaluation metrics, allowing to reject (adversarial) examples.