Papernot et al. build upon the idea of network distillation  and propose a simple mechanism to defend networks against adversarial attacks. The main idea of distillation – originally introduced to “distill” the knowledge of very deep networks into smaller ones – is to train a second, possibly smaller network, with the probability distributions of the original, possibly larger network as supervision. Papernot et al. as well as the authors of  argue that the probability distributions, i.e. the activations of the final softmax layer (also referred to as “soft” labels), contain rich information about the task in contrast to the true “hard” labels. This allows the network to achieve similar performance while using less parameters or a different architecture.
However, Papernot et al. do not distill a network's knowledge into a smaller one; instead they use distillation to make networks robust against adversarial attacks. They argue that most algorithms to generate adversarial examples make use of the “adversarial gradient”; i.e. the gradient of the network's cost w.r.t. its input. The adversarial gradient then guides perturbation of the input image in the direction of wrong classes (the authors consider a simple classification task for simplicity). Therefore, Papernot et al. Argure, the gradient around training samples needs to be reduced – in other words, the model needs to be smoothed.
Figure 1: Illustration of the proposed 2-stage training of smoothed and more robust networks. Also see the paper for details.
The proposed approach is very simple, they just distill the knowledge of the network into another network with same architectures and hyper parameters. By using the probability distributions as “soft” labels instead of the hard labels for training, the network is essentially smoothed. The full procedure is illustrated in Figure 1.
Despite the simplicity of the approach, I want to highlight some additional key observations: