ICML Talk “Confidence-Calibrated Adversarial Training”

Confidence-calibrated adversarial training (CCAT) addresses two problems when training on adversarial examples: the lack of robustness against adversarial examples unseen during training, and the reduced (clean) accuracy. In particular, CCAT biases the model towards predicting low-confidence on adversarial examples such that adversarial examples can be rejected by confidence thresholding. In this article, I want to share the slides of the corresponding ICML talk.


Check out the paper on ArXiv.

Adversarial training (AT), i.e., training on adversarial examples generating on-the-fly, is standard to obtain robust models within a specific threat model, e.g., $L_\infty$ adversarial examples. However, it has been shown that the obtained robustness does not generalize to attacks not seen during training such as larger $L_\infty$ perturbations or other $L_p$ threat models. Additionally, AT is known to reduce accuracy on clean examples. In this talk, I introduce Confidence-calibrated adversarial training (CCAT): CCAT tackles both problems by biasing the network towards low-confidence predictions on adversarial examples. By extrapolating low-confidence predictions beyond the $L_\infty$ adversarial examples seen during training, robustness generalizes to previously unseen attacks by rejecting low-confidence (adversarial) examples. Trained only on $L_\infty$ adversarial examples, I demonstrate improved robustness against $L_2$, $L_1$ and $L_0$ adversarial examples as well as adversarial frames. Furthermore, compared to AT, accuracy is improved.

Adversarial Training (AT):

Confidence-Calibrated Adversarial Training (CCAT):

Figure 1: Illustration of how confidence-calibrated adversarial training (CCAT, right) tackles the problems of poor generalization of the obtained robustness of standard adversarial training (AT, left).


The slides of the talk can be found below:


What is your opinion on this article? Did you find it interesting or useful? Let me know your thoughts in the comments below: