Kannan et al. propose a defense against adversarial examples called adversarial logit pairing where the logits of clean and adversarial example are regularized to be similar. In particular, during adversarial training, they add a regularizer of the form:
$\lambda L(f(x), f(x’))$
were $L$ is, for example, the $L_2$ norm and $f(x’)$ the logits corresponding to adversarial example $x’$ (corresponding to clean example $x$). Intuitively, this is a very simple approach – adversarial training itself enforces the classification results of clean and corresponding adversarial examples to be the same and adversarial logit pairing enforces the internal representation, i.e., the logits, to be similar. In theory, this could also be applied to any set of activations within the network. In the paper, they conclude that
“We hypothesize that adversarial logit pairing works well because it provides an additional prior that regularizes the model toward a more accurate understanding of the classes.”
In experiments, they show that this approach slightly outperforms adversarial training alone on SVHN, MNIST as well as ImageNet.