The topic of robust deep learning is receiving considerable attention in the last few years. The observation of adversarial examples was first reported in . In general, it describes the existence of small perturbations — unrecognizable to humans — of testing samples that result in a mis-classification, see Figure 1.
In this article, I want to provide a rough overview and survey of the literature concerned with constructing adversarial examples, defending against them as well as the theory behind their existence. Additionally, I includes some thoughts on future research directions. The article is, however, not intended to be complete; every week, new papers on adversarial examples are uploaded to ArXiv such that it is impossible to keep this article perfectly up-to-date. However, this article will cover the most important papers until Februrary 2018.
Figure 1 (click to enlarge): Illustration of an adversarial example taken from . The original image is "attacked" using the adversarial perturbation shown in the middle; as a result, the classifier mis-classifies the example although the change is ‐ for human eyes ‐ not perceivable.
Detailed comments on individual papers can also be found in my Reading Notes.
- (1) Original Paper ;
- (2) Attacks [2, 3, 4];
- (I) White-Box/Gray-Box:
- (A) FGSM ;
- (B) PGD ;
- (C) CW ;
- (D) Universal ;
- (E) Transferability/Studies [9, 10, 6];
- (F) Physical [11, 12];
- (G) Translation/Rotation ;
- (H) Adversarial Saliency Maps ;
- (I) Logits Attack (LOTS) ;
- (II) Black-Box:
- (A) ZOO ;
- (B) Boundary Attack ;
- (C) One-Pixel Attack ;
- (I) White-Box/Gray-Box:
- (3) Membership Inference Attacks ;
- (4) Poisoning Attacks ;
- (5) Whitening/Reverse Engineering ;
- (6) Defenses:
- (I) Ensembles [22, 23, 24, 25, 4];
- (II) Adversarial Training (and Interpretations/Variants) [26, 27, 28, 29, 30, 31];
- (A) Robust Optimization ;
- (B) Distributional Robust Optimization ;
- (C) Generative Adversarial Training ;
- (D) Convex Outer Adversarial Polytope ;
- (III) Bounded ReLU ;
- (IV) Feature Squeezing ;
- (V) PCA ;
- (VI) Saturating Networks ;
- (VII) Distillation ;
- (VIII) Iterative GAN/Deep Image Prior Projection [36, 37];
- (IX) (Random) Image Transformations [38, 39, 40];
- (X) Regularization [41, 42, 43, 44];
- (XI) Discretization ;
- (XII) Adaptive JPEG quantization ;
- (XIII) Rectification/Detection ;
- (XIV) Out-Distribution Training ;
- (XV) Logit Pairing ;
- (7) Detection and Avoidance [50, 51, 52, 53, 55, 56, 57];
- (8) Theory/Interpretations:
- (I) Linear Explanation ;
- (II) Lipschitz(-Inspired) [42, 58, 59, 43];
- (III) Upper Risk Bound ;
- (IV) Boundary Tilting ;
- (V) Semi-Random Noise ;
- (VI) Manifold Explanation [61, 36, 63, 64] (also see );
- (VII) Intrinsic Dimensionality [55, 56];
- (VIII) Loss Gradient Increases with Input Dimensionality ;
- (IX) Generalization Explanation;
- (X) Accuracy-Robustness Trade-Off ;
- (I) Semantic Segmentation [74, 75];
- (II) Object Detection ;
- (III) Generative Models [76, 77];
- (IV) Reinforcement Learning [78, 79];
- (V) Robot Vision/iCub ;
- (VI) Visual Question Answering ;
- (VII) Face Identification ;
Recent literature can roughly be divided into five lines of work: attacks, defenses, detection, theory and applications. Most published work is considering either attacks or defenses/detection — and, currently, it looks like the literature on defending against or detecting adversarial examples is "loosing the battle". In contrast, only few papers are considering the theoretical background of why and where adversarial examples exist, or are able to give guarantees on robustness. Finally, adversarial examples have been shown to exist for a variety of tasks including object detection, semantic segmentation, image generation or reinforcement learning to name just a few.
First, several attacks have been proposed [5, 6, 7, 8] — mostly based on a white-box scenario where the attacker has access to the full model including gradients. However, black-box attacks have also been considered, e.g. using zero order optimization . Interesting attack scenarios are physical attacks, usually evaluated by printing adversarial examples [11, 12]. Some works also try to draw the line between randomness and adverseness  or craft adversarial geometric transformations  which essentially asks the question where generalization starts and robustness ends. Overall, most of the white-box attacks are first-order gradient-based methods that usually try to directly maximize the classifiers loss in the vicinity of a testing sample — or use some surrogate loss instead. Thus, Madry et al.  use projected gradient descent as the strongest first-order gradient-based adversary.
Second, several defenses against proposed the attacks have been developed [22, 23, 24, 25, 4, 26, 27, 28, 29, 29, 26, 33, 34, 35]. While attacks are comparably easy to benchmark — even in the light of existing defense mechanisms —, defenses are relatively hard to develop and study. This is the essence of any security problem: the adversary has to find at least one adversarial example; the defender, in contrast, has to make sure that no such example can be found at all — a considerably harder task. As a result, some defense mechanisms have been proven wrong. Generally, defense mechanisms include adversarial training (i.e. training on adversarial examples) and its variants [26, 27, 28, 29], ensemble training [22, 23, 24, 25, 4] adaptations of the architecture (e.g. saturating networks , bounded ReLUs  etc.), pre-processing [34, 33] or distillation . However, there have been studies questioning (in terms of experimental evaluation) the effectiveness of distillation , adversarial training [10, 4] and saturating networks ; for other defense mechanisms, independent studies of their effectiveness are missing.
Third, several mechanisms to efficiently detect and avoid adversarial examples have been published [51, 52, 53]. However, their effectiveness has also been questioned in the literature . After all, these detection schemes are also machine learning systems that try to classify adversarial examples; as such, in a white-box scenario, the idea of developing systems for detecting and subsequently avoiding adversarial examples seems to be doomed from the start.
Fourth, there has been some work on theoretical guarantees and bounds regarding robustness against adversarial examples. An important question was foreshadowed by Szegedy et al.  and explicitly posed by Goodfellow et al. : why do adversarial examples exist? Goodfellow et al. provide the "linear explanation" arguing that dot-products (as well as convolutions) are susceptible to adversarial noise due to the summation operation, which amplify small per-pixel noise in high dimensions. Tanay et al. , however, argue that the "linear explanation" is not convincing and provide an alternative "boundary tilting" perspective — also referred to as the "manifold assumption". In their paper, the data is assumed to lie on a sub-manifold; while the classifier might be robust on this manifold, i.e. for most samples it is hard to find a decision boundary within a small vicinity, robustness is lost when leaving the manifold. In a slightly different direction [42, 60], upper bounds on the robustness are given; Hein et al.  present an argument based on local Lipschitz continuity, while Fawzi et al.  present an upper bound based on generalization risk and the data distributions. In , Fawzi further provide an interesting perspective on the transition between random noise and adversarial examples; and in , Bastani et al. provide theoretical metrics for measuring robustness.
Finally, several works [74, 75, 11, 76, 77, 78, 79] address applications. These include structured problems such as semantic segmentation and object detection as well as reinforcement learning or generative models. Further applications such as face recognition or tasks in natural language processing are discussed in surveys [71, 72]. Physical adversarial examples [11, 12] could generally also be viewed as interesting applications.
On a final note, there are some papers (experimentally) studying properties of adversarial examples and their relationship to deep neural network properties. An important property is transferability, which is also the basis of many black-box attacks. Similarly, an important question is whether network architectures or complexity influence transferability as well as robustness to adversarial examples. Some works [9, 10] try to study these problems in more detail; also focussing on the relationship between capacity and robustness/tranferability. Similar to transferability, it has been shown that adversarial examples might be universal, i.e., image-agnostic .
Personally, I found that there are several aspects regarding adversarial examples and robustness that are not fully understood and explored yet. This includes a common understanding of relevant threat models (e.g., white-box, gray-box, black-box as well as attacks at training time), consistent evaluation methodologies (especially for defenses) as well as theoretical guarantees. In the following, I want to provide some thoughts on possible future research directions.
First, there is no clear notion of relevant threat models. Only some works [7, 34, 16, 52], especially from the security and privacy community, try to specify the knowledge and abilities of the adversary. Especially  and  try to make the abilities of the adversary in white- and black-box scenarios explicit. In the computer vision and machine learning communities, in contrast, the threat model is usually defined implicitly through the attacks used in evaluation — commonly limited to first-order gradient-based adversaries. These attacks usually correspond to adversaries with limited knowledge: the adversary has access to the model, its parameters, outputs and gradients (at test time) and knows about employed defenses. Other notions of gray-box adversaries, adversaries operating at training time, or more powerful adversaries in terms of higher-order optimization techniques are rarely considered.
Second, given relevant threat models, an agreed-upon benchmark specifying datasets (including splits), attacks and defenses as well as evaluation metrics is missing. For proposed defenses for example, there is no common and clearly defined metric of "robustness". While  proposes several theoretical measures of robustness, their computation in practice is cumbersome. Additionally, the effectiveness of attacks against a wide variety of defense mechanisms, training protocols and network architectures is missing (although some hints are given in [9, 10]). The same holds for most defenses; often, it is unclear how effective these defense are in practice — some attacks have already been shown to be less effective as originally advocated [7, 10, 4, 17].
Third, there is a significant gap between theoretical insights regarding robustness and attacks/defenses studied in practice. Attacks are generally based on iterative gradient ascent and can, as such, be studied under the umbrella of optimization. Similarly, adversarial training is commonly studied in terms of robust optimization; although other interpretations [29, 27, 28] have also been proposed. Recently, some papers also try to provide robustness guarantees (e.g., [42, 30, 59]) or trade-offs (e.g., robustness-generalization trade-off [63,66]). The experimental or theoretical settings are, however, usually constraint to simple models and datasets — and the conclusions might be contradicting [63,66].
Finally, I want to conclude with some problems that I couldn't find addressed in the literature. For example, perfect knowledge adversaries with access to the training procedure including training data (partly known as poisoning attacks ). Or problems of authenticated or verifiable use of deep learning including the generation of "proofs" for predictions of rejecting input. Another interesting class of problems include model theft or whitening/re-engineering (e.g., ).
-  Christian Szegedy et al. Intriguing properties of neural networks. In: arXiv.org abs/1312.6199 (2013).
-  Nina Narodytska and Shiva Prasad Kasiviswanathan. Simple Black-Box Adversarial Attacks on Deep Neural Networks. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) Workshops. 2017, pp. 1310—1318.
-  Nicolas Papernot et al. Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks. In: Proc. of the IEEE Symposium on Security and Privacy. 2016, pp. 582—597.
-  Florian Tramer et al. Ensemble Adversarial Training: Attacks and Defenses. In: arXiv.org abs/1705.07204 (2017).
-  Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In: arXiv.org abs/1412.6572 (2014).
-  Aleksander Madry et al. Towards deep learning models resistant to adversarial attacks. In: arXiv.org abs/1706.06083 (2017).
-  Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In: Proc. of the IEEE Symposium on Security and Privacy. IEEE. 2017, pp. 39—57.
-  Seyed-Mohsen Moosavi-Dezfooli et al. Universal adversarial perturbations. In: arXiv.org abs/1610.08401 (2016).
-  Yanpei Liu et al. Delving into transferable adversarial examples and black-box attacks. In: arXiv.org abs/1611.02770 (2016).
-  Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In: arXiv.org abs/1611.01236 (2016).
-  Jiajun Lu et al. No need to worry about adversarial examples in object detection in autonomous vehicles. In: arXiv.org abs/1707.03501 (2017).
-  Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In: arXiv.org abs/1607.02533 (2016).
-  Logan Engstrom et al. A Rotation and a Translation Suffice: Fooling CNNs with Simple Transformations. In: arXiv.org abs/1712.02779 (2017).
-  Nicolas Papernot et al. The Limitations of Deep Learning in Adversarial Settings. In: Proc. of the IEEE Symposium on Security and Privacy. 2016, pp. 372—387.
-  Andras Rozsa, Manuel Gunther, and Terrance E. Boult. Adversarial Robustness: Softmax versus Openmax. In: arXiv.org abs/1708.01697 (2017).
-  Pin-Yu Chen et al. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In: Proc. of the ACM Workshop on Artificial Intelligence and Security. ACM. 2017, pp. 15—26.
-  Wieland Brendel and Matthias Bethge. Comment on Biologically inspired protection of deep networks from adversarial attacks. In: arXiv.org abs/1704.01547 (2017).
-  Jiawei Su, Danilo Vasconcellos Vargas, and Kouichi Sakurai. One pixel attack for fooling deep neural networks. In: arXiv.org abs/1710.08864 (2017).
-  Reza Shokri et al. Membership Inference Attacks Against Machine Learning Models. In: Proc. of the IEEE Symposium on Security and Privacy. 2017, pp. 3—18.
-  Battista Biggio, Blaine Nelson, and avel Laskov. Poisoning Attacks against Support Vector Machines. In: Proc. of the International Conf. on Machine learning (ICML). 2012.
-  Seong Joon Oh et al. Whitening Black-Box Neural Networks. In: arXiv.org abs/1711.01768 (2017).
-  Xuanqing Liu et al. Towards Robust Neural Networks via Random Self-ensemble. In: arXiv.org abs/1712.00673 (2017).
-  Thilo Strauss et al. Ensemble methods as a defense to adversarial perturbations against deep neural networks. In: arXiv.org abs/1709.03423 (2017).
-  Tom Zahavy et al. Ensemble Robustness and Generalization of Stochastic Deep Learning Algorithms. In: arXiv.org abs/1602.02389 (2016).
-  Warren He et al. Adversarial Example Defenses: Ensembles of Weak Defenses are not Strong. In: arXiv.org abs/1706.04701 (2017).
-  Valentina Zantedeschi, Maria-Irina Nicolae, and Ambrish Rawat. Efficient defenses against adversarial attacks. In: Proc. of the ACM Workshop on Artificial Intelligence and Security. ACM. 2017, pp. 39—49.
-  Takeru Miyato et al. Distributional smoothing with virtual adversarial training. In: arXiv.org abs/1507.00677 (2015).
-  Ruitong Huang et al. Learning with a strong adversary. In: arXiv.org abs/1511.03034 (2015).
-  Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understanding adversarial training: Increasing local stability of neural nets through robust optimization. In: arXiv.org abs/1511.05432 (2015).
-  Aman Sinha, Hongseok Namkoong, and John C. Duchi. Certifiable Distributional Robustness with Principled Adversarial Training. In: arXiv.org abs/1710.10571 (2017).
-  Hyeungill Lee, Sungyeob Han, and Jungwoo Lee. Generative Adversarial Trainer: Defense to Adversarial Perturbations with GAN. In: arXiv.org abs/1705.03387 (2017).
-  J. Zico Kolter and Eric Wong. Provable defenses against adversarial examples via the convex outer adversarial polytope. In: arXiv.org abs/1711.00851 (2017).
-  Weilin Xu, David Evans, and Yanjun Qi. Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks. In: arXiv.org abs/1704.01155 (2017).
-  Arjun Nitin Bhagoji, Daniel Cullina, and Prateek Mittal. Dimensionality Reduction as a Defense against Evasion Attacks on Machine Learning Classifiers. In: arXiv.org abs/1704.02654 (2017).
-  Aran Nayebi and Surya Ganguli. Biologically inspired protection of deep networks from adversarial attacks. In: arXiv.org abs/1703.09202 (2017).
-  Andrew Ilyas et al. The Robust Manifold Defense: Adversarial Training using Generative Models. In: arXiv.org abs/1712.09196 (2017).
-  Rama Chellappa Pouya Samangouei Maya Kabkab. Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Generative Models. In: Proc. of the International Conf. on Learning Representations (ICLR) (2018).
-  Aaditya Prakash et al. Deflecting Adversarial Attacks with Pixel Deflection. In: arXiv.org abs/1801.08926 (2018).
-  Chuan Guo et al. Countering Adversarial Images using Input Transformations. In: arXiv.org abs/1711.00117 (2017).
-  Cihang Xie et al. Mitigating adversarial effects through randomization. In: arXiv.org abs/1711.01991 (2017).
-  Carl-Johann Simon-Gabriel et al. Adversarial Vulnerability of Neural Networks Increases With Input Dimension. In: arXiv.org abs/1802.01421 (2018).
-  Matthias Hein and Maksym Andriushchenko. Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation. In: arXiv.org abs/1705.08475 (2017).
-  Daniel Jakubovitz and Raja Giryes. Improving DNN Robustness to Adversarial Attacks using Jacobian Regularization. In: arXiv.org abs/1803.08680 (2018).
-  Andrew Slavin Ross and Finale Doshi-Velez. Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients. In: arXiv.org abs/1711.09404 (2017).
-  Jacob Buckman et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples. In: Proc. of the International Conf. on Learning Representations (ICLR). 2018.
-  Aaditya Prakash et al. Protecting JPEG Images Against Adversarial Attacks. In: arXiv.org abs/1803.00940 (2018).
-  Naveed Akhtar, Jian Liu, and Ajmal S. Mian. Defense against Universal Adversarial Perturbations. In: arXiv.org abs/1711.05929 (2017).
-  Mahdieh Abbasi and Christian Gagne. Out-distribution training confers robustness to deep neural networks. In: arXiv.org abs/1802.07124 (2018).
-  Harini Kannan andAlexey Kurakin andIan J. Goodfellow. Adversarial Logit Pairing. In: arXiv.org abs/1803.06373 (2018).
-  Bita Darvish Rouhani et al. Towards Safe Deep Learning: Unsupervised Defense Against Generic Adversarial Attacks. 2018. URL: https://openreview.net/forum?id=HyI6s40a-.
-  Kathrin Grosse et al. On the (statistical) detection of adversarial examples. In: arXiv.org abs/1702.06280 (2017).
-  Nicholas Carlini and David Wagner. Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods. In: arXiv.org abs/1705.07263 (2017).
-  Reuben Feinman et al. Detecting Adversarial Samples from Artifacts. In: arXiv.org abs/1703.00410 (2017).
-  Xingjun Ma et al. Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality. In: arXiv.org abs/1801.02613 (2018).
-  Laurent Amsaleg et al. The vulnerability of learning to adversarial perturbation increases with intrinsic dimensionality. In: Proc. of the IEEE Workshop on Information Forensics and Security. 2017, pp. 1—6.
-  Jan Hendrik Metzen et al. On Detecting Adversarial Perturbations. In: arXiv.org abs/1702.04267 (2017).
-  Moustapha Cisse et al. Parseval Networks: Improving Robustness to Adversarial Examples. In: Proc. of the International Conf. on Machine learning (ICML). 2017, pp. 854—863.
-  Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified Defenses against Adversarial Examples. In: arXiv.org abs/1801.09344 (2018).
-  Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Fundamental limits on adversarial robustness. In: Proc. of the International Conf. on Machine learning (ICML) Workshops. 2015.
-  Thomas Tanay and Lewis Griffin. A boundary tilting persepective on the phenomenon of adversarial examples. In: arXiv.org abs/1608.07690 (2016).
-  Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Robustness of classifiers: from adversarial to random noise. In: Advances in Neural Information Processing Systems (NIPS). 2016, pp. 1632—1640.
-  Justin Gilmer et al. Adversarial Spheres. In: arXiv.org abs/1801.02774 (2018).
-  Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Deep Image Prior. In: arXiv.org abs/1711.10925 (2017).
-  Ronen Basri and David W. Jacobs. Efficient Representation of Low-Dimensional Manifolds using Deep Networks. In: arXiv.org abs/1602.04723 (2016).
-  Dimitris Tsipras et al. There Is No Free Lunch In Adversarial Robustness (But There Are Unexpected Benefits). In: arXiv.org abs/1805.12152 (2018).
-  Osbert Bastani et al. Measuring neural net robustness with constraints. In: Advances in Neural Information Processing Systems (NIPS). 2016, pp. 2613—2621.
-  Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 2015, pp. 427—436.
-  Andras Rozsa, Ethan M. Rudd, and Terrance E. Boult. Adversarial Diversity and Hard Positive Generation. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) Workshops. 2016, pp. 410—417.
-  Marco Barreno et al. Can machine learning be secure? In: Proc. of the ACM Symposium on Information, Computer and Communications Security. 2006, pp. 16—25.
-  Xiaoyong Yuan et al. Adversarial Examples: Attacks and Defenses for Deep Learning. In: arXiv.org abs/1712.07107 (2017).
-  Naveed Akhtar and Ajmal Mian. Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey. In: arXiv.org abs/1801.00553 (2018).
-  Battista Biggio and Fabio Roli. Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning. In: arXiv.org abs/1712.03141 (2018).
-  Volker Fischer et al. Adversarial Examples for Semantic Image Segmentation. In: arXiv.org abs/1703.01101 (2017).
-  Moustapha M Cisse et al. Houdini: Fooling Deep Structured Visual and Speech Recognition Models with Adversarial Examples. In: Advances in Neural Information Processing Systems (NIPS). 2017, pp. 6980—6990.
-  Pedro Tabacof, Julia Tavares, and Eduardo Valle. Adversarial Images for Variational Autoencoders. In: arXiv.org abs/1612.00155 (2016).
-  Jernej Kos, Ian Fischer, and Dawn Song. Adversarial examples for generative models. In: arXiv.org abs/1702.06832 (2017).
-  Sandy H. Huang et al. Adversarial Attacks on Neural Network Policies. In: arXiv.org abs/1702.02284 (2017).
-  Yen-Chen Lin et al. Tactics of Adversarial Attack on Deep Reinforcement Learning Agents. In: Proc. of the International Joint Conf. on Artificial Intelligence (IJCAI). 2017, pp. 3756—3762.
-  Marco Melis et al. Is Deep Learning Safe for Robot Vision? Adversarial Examples Against the iCub Humanoid. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) Workshops. 2017, pp. 751—759.
-  Hongge Chen et al. Show-and-Fool: Crafting Adversarial Examples for Neural Image Captioning. In: arXiv.org abs/1712.02051 (2017).
-  Andras Rozsa, Manuel Gunther, and Terrance E. Boult. LOTS about attacking deep features. In: Proc. of the IEEE International Joint Conference on Biometrics, IJCB 2017. 2017, pp. 168—176.
-  Jonas Rauber, Wieland Brendel, and Matthias Bethge. Foolbox v0.8.0: A Python toolbox to benchmark the robustness of machine learning models. In: arXiv.org abs/1707.04131 (2017).
-  Christian Szegedy et al. Intriguing properties of neural networks. In: Proc. of the International Conf. on Learning Representations (ICLR). 2014.
-  Olga Russakovsky et al. ImageNet Large Scale Visual Recognition Challenge. In: arXiv.org 1409.0575 (2014).