Proper Robustness Evaluation of Confidence-Calibrated Adversarial Training in PyTorch

Properly evaluating defenses against adversarial examples has been difficult as adversarial attacks need to be adapted to each individual defense. This also holds for confidence-calibrated adversarial training, where robustness is obtained by rejecting adversarial examples based on their confidence. Thus, regular robustness metrics and attacks are not easily applicable. In this article, I want to discuss how to evaluate confidence-calibrated adversarial training in terms of metrics and attacks.


Adaptive adversarial attacks have become a key ingredient in proper robustness evaluation. These attacks take into account the different strategies that adversarial defenses, including various types of adversarial training, employ to improve adversarial robustness. Essentially, it grants the attacker knowledge about this specific strategy to adapt the attack accordingly. However, for many adversarial training based methods, projected gradient descent (PGD) based attacks with enough restarts and iterations, potentially coupled with momentum, seems to work reasonably well.

However, regarding confidence-calibrated adversarial training, as discussed in this previous article, standard PGD does not necessarily work. While it obtains a high robust test error (that is, successfully changes classification), most of the computed adversarial examples receive low confidence. Thus, in this article, I want to address two problems: First, an adaptive attack will explicitly maximize confidence to circumvent the confidence threshold. Second, how to integrate a confidence-based rejection scheme into evaluation. This will lead to a confidence-thresholded robust test error.

This article is part of a series of articles:

In particular, this article builds on the PGD implementation presented in previous articles.

The code for this article can be found on GitHub:

Code on GitHub

Adaptive Attacks

An adaptive attack for confidence-calibrated adversarial training takes into account that low-confidence adversarial examples will be rejected. That is, the best strategy for the attacker is to find adversarial examples that receive high confidence and cause mis-classification. This is because this increases the likelihood that the adversarial example is not rejected. In theory, the attacker could even know the exact confidence threshold used by confidence-calibrated adversarial training and stop the computation as soon as the adversarial example receives high-enough confidence. However, I will address how to compute the confidence threshold later.

Luckily, by maximizing confidence in any other labels, the projected gradient descent (PGD) attack employed during confidence-calibrated adversarial training is already an appropriate adaptive attack. After training, however, significantly more iterations and restarts can be used. Additionally, backtracking and momentum take into account that confidence transitions from high to low values within the $\epsilon$-ball, resulting in a more complex optimization problem compared to standard adversarial training.

PyTorch code for PGD with backtracking and momentum, maximizing confidence, can be found in this previous article.

Confidence-Thresholded Robust Test Error

After computing adversarial examples, robust test error quantifies the model's robustness by counting the number of examples that can be attacked successfully. Formally, the robust test error can be defined as

$\frac{1}{N} \sum_{n = 1}^N \max_{\|\delta\|_\infty \leq \epsilon} \mathbb{1}_{f(x_n + \delta) \neq y_n}$(1)

where $N$ is the number of test examples $(x_n, y_n)$ and $f$ the model. Note that this also takes into account mis-classified examples, which can be thought of as "trivial" adversarial examples.

As the standard robust test error does not take into account the confidence of adversarial examples, it will usually be very high for confidence-calibrated adversarial training. This is because the model is not forced to predict the right class for large perturbations ($\|\delta\|_\infty = \epsilon$) but to predict uniform probabilities. This can be handled using a confidence threshold $\tau$ and only considering examples $x_n$ whose confidence $c(x_n) \geq\tau$ exceed this threshold:

$\frac{\sum_{n = 1}{N} \max_{\|\delta\|_\infty \leq \delta, x(x_n + \delta) \geq \tau} \mathbb{1}_{f(x_n + \delta) \neq y_n}}{\sum_{n = 1}^N \max_{\|\delta\|_\infty \leq \epsilon} \mathbb{1}_{c(x_n + \delta) \geq \tau}}$(2)

Essentially, this equation makes sure that we only consider examples $x_n$ which themselves receive high confidence $c(x_n) \geq \tau$ or where we found an adversarial example with high confidence $c(x_n + \delta) \geq \tau$. This corresponds to the demoninator. Then, among these examples, we count those where an adversarial example with high confidence can be found that flips the true label: $c(x_n + \delta) \geq \tau$ and $f(x_n + \delta) \neq y_n$. Note that, for $\tau = 0$, the demoninator contains all examples such that we just compute the "regular", unthresholded robust test error from above. Thus, this confidence-thresholded version is fully comparable to existing results in the literature.

As highlighted in [], this formulation also takes care of several special cases where, for example, a high-confidence adversarial example is found for a low-confidence example. This makes implementing the confidence-thresholded robust test error non-trivial as I will show in the next section.

Selecting a Threshold

It remains unclear how to optimally choose the threshold $\tau$. The key objective is to select a threshold that reduces the (confidence-thresholded) robust test error as much as possible, while not rejecting too many correctly classified clean examples. So, if we define high-confidence adversarial examples as negatives (in a detection setting) and correctly classified clean examples as positives, we want to control the true positive rate (TPR) while reducing robust test error.

Additionally, we would like to choose the threshold $\tau$ without relying on adversarial examples. This is because it avoids overfitting the threshold to a specific type of attack or choosing it against too weak attacks. This leaves only clean examples to calibrate $\tau$ on.

Given these requirements, it seems natural just to fix the TPR. For example, we can calibrate $\tau$ in order to obtain a 99% TPR on a held-out validation set.

PyTorch Implementation

In order to appropriately implement the confidence-thresholded robust test error, we need the two main components outlined above: implementing the calibration procedure for $\tau$ and implementing Equation (2) using a given $\tau$. Note that a PGD attack maximizing confidence has been discussed in this previous article.

As outlined above, calibration just requres access to the confidences on a held-out validation set of clean examples:

Listing 1: Calibrating the threshold $\tau$ to obtain a specified TPR.

def confidence_at_tpr(self, tpr):
    # 1. Compute confidences of correctly-classified clean examples on validation set:
    # self.validation confidences are the confidences for the validation examples
    # (i.e., maximal predicted probability) and self.validation error is a 0-1
    # array indicating which examples have been classified incorrectly.
    self.sorted_validation_confidences = numpy.sort(numpy.copy(self.validation_confidences[numpy.logical_not(self.validation_errors)]))
    # 2. Determine the example where the corresponding confidence results in the target TPR:
    cutoff = math.floor(self.sorted_validation_confidences.shape[0] * round((1 - tpr), 2))
    return self.sorted_validation_confidences[cutoff]
  1. First, the confidences on correctly classified clean examples are sorted. Essentially, we consider each of the confidences as a potential value for $\tau$.
  2. Then, we determine the actual threshold by taking the `1 - TPR` quantile of these values. This ensures, that we obtain the target TPR on the validation set by construction and this should generalize to the test set as long as examples come from the same data distribution.

Next, the confidence-thresholded robust test error is computed by simply implementing Equation (2), considering nominator and denominator separately:

Listing 2: Compute confidence-thresholded robust test error.

def robust_test_error_at_confidence(self, threshold):
    # 1. Compute the nominator:
    # self.reference_errors and references_confidences are the clean error indicators and
    # corresponding confidences; these are used to compute mis-classified test examples.
    # self.test_adversarial_errors and confidences are the corresponding (worst-case)
    # adversarial example errors and confidences.
    # Note that the nominator is essentially computed by computing clean test errors
    # and robust test errors separately.
    nominator = (numpy.sum(self.reference_errors[self.reference_confidences >= threshold].astype(int))\
                + numpy.sum(self.test_adversarial_errors[numpy.logical_and(self.test_adversarial_confidences >= threshold, numpy.logical_not(self.reference_errors))].astype(int)))
    # 2. Compute the denominator:
    # Again, we first compute the number of clean examples that exceed the threshold
    # and add the number of examples where this is not the case but
    # the corresponding adversarial example exceeds the threshold.
    denominator = (numpy.sum((self.reference_confidences >= threshold).astype(int))\
                + numpy.sum(numpy.logical_and(numpy.logical_and(numpy.logical_not(self.reference_errors), self.reference_confidences < threshold), numpy.logical_and(self.test_adversarial_errors, self.test_adversarial_confidences >= threshold))))
    if denominator > 0:
        return nominator / float(denominator)
        return 0
  1. The nominator computes the clean test errors and robust test errors, subject to the `threshold`, separately and adds them together.
  2. The denominator also comprises two terms: the number of clean examples exceeding the threshold, and the number of examples that obtain low confidence but where the corresponding adversarial example exceeds the threshold.

In practice, 10% of the test examples are used as a validation set to pick the threshold. Note that a confidence-thresholded (clean) error can ba computed analogously to Listing 2.


Table 1: Evaluation results using standard and confidence-thresholded robust test error, at 99% TPR, comparing AutoAttack (AA), AA maximizing confidence and PGD maximizing confidence. Both models (AT and CCAT) are WRN-18-10 architectures trained on CIFAR10.

Attack$\tau = 0$99% TPR$\tau = 0$99% TPR
$L_\infty$ AA47.942.999.55.7
$L_\infty$ AA-Conf43.11009.5
$L_\infty$ PGD-Conf43.794.250.8
$L_2$ PGD-Conf67.767.49044.2

In Table 1, I summarize several robustness results, considering $L_\infty$ attacks with $\epsilon=0.03$ and $L_2$ attacks with $\epsilon = 1$ on CIFAR10. I want to start with the adversarially trained model [] from this previous article, denoted AT. All models are WRN-28-10 architectures and you can see that the AT model obtains 47.9% unthresholded robust test error, as in the previous article. Allowing AT to benfit from confidence thresholding, this can be reduced to 42.9%. However, this can be misleading since AA [] does not explicitly maximize confidence. Thus, this is not the ideal attack for evaluating confidence-thresholded robust test error. Indeed, maximizing confidence using AA or standard PGD increases robust test error to 43.1% or 43.7%, respectively. This also shows that AA is not optimal. Generally, the confidence-thresholded robust test error will aways be slightly lower than the standard variant — at least for standard AT.

Confidence-calibrated adversarial training (CCAT) generally obtains very high standard robust test errors, like 99.5% against AA. However, after thresholding, this is reduced to 5.7%, once more showing that standard AA is a poor choice to evaluate CCAT. PGD for maximizing confidence, in contrast, can increase confidence-thresholded robust test error to 50.8%. Thereby, CCAT performs slightly worse than standard AT. However, picture changes a bit when evaluating against $L_2$ attacks that both models have not seen during training. Here, CCAT outperforms AST with 44.2% robust test error vs. 67.4%. I also tested the more recently proposed adaptive AutoAttack [] where the model from Table 1 achieves 54.4% robust test error for $\epsilon = 0.03$ and $L_\infty$ adversarial examples — note that the results in [] use $\epsilon \approx 0.315$.


In summary, for evaluating confidence-calibrated adversarial training, it is crucial to consider the confidence of adversarial examples. In this article, I introduced a confidence-thresholded robust test error that is fully comparable to the standard one. The confidence threshold is controlled by enforcing a specific true positive rate (TPR) on clean examples, for example, 99%. This formulation allows models to obtain robustness in two ways: correctly classifying adversarial examples or assigning low confidence to them to reject them. In the next article, I will show that this also allows to reject out-of-distribution examples — so-called distal adversarial examples.

  • [] David Stutz, Matthias Hein, Bernt Schiele. Confidence-Calibrated Adversarial Training: Generalizing to Unseen Attacks. ICML 2020: 9155-9166
  • [] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu. Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR (Poster) 2018.
  • [] Francesco Croce, Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. ICML 2020: 2206-2216
  • [] Ye Liu, Yaya Cheng, Lianli Gao, Xianglong Liu, Qilong Zhang, Jingkuan Song. Practical Evaluation of Adversarial Robustness via Adaptive Auto Attack. CVPR 2022: 15084-15093
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.