Akhtar et al. propose a rectification and detection scheme as defense against universal adversarial perturbations. Their overall approach is illustrated in Figure 1 an briefly summarized as follows. Given a classifier with fixed weights, a rectification network (the so-called perturbation rectifying network – PRN) is trained in order to “undo” the perturbations. This network can be trained on a set of clean and perturbed images using the classifier’s loss. Second, based on the discrete cosine transform (DCT) of the difference between original and rectified image (both for clean and perturbed images), a SVM is trained to detect adversarially perturbed images. At test time, only images that have been identified as being perturbed are rectified. In experiments, the authors show that this setup is able to defend against adversarial attacks and does not influence the classifier’s accuracy significantly.
Figure 1: The proposed perturbation rectifying network (PRN) asnd the correcponding perturbation detector.
Overall, the proposed approach is comparable to other work that tries to either detect adversarial perturbations, or to remove them from the test image. One advantage is that the classifier itself does not need to be re-trained. However, as the rectification network is itself a (convolutional) neural netowrk, and the detector is a SVM, both are also potential targets of attacks – althoguh attacking the whole system might be more challenging (especially crafting universal perturbations).