Ghorbani et al. show that neural network visualization techniques, often introduced to improve interpretability, are susceptible to adversarial examples. For example, they consider common feature-importance visualization techniques and aim to find an advesarial example that does not change the predicted label but the original interpretation – e.g., as measured on some of the most important features. Examples of the so-called top-1000 attack where the 1000 most important features are changed during the attack are shown in Figure 1. The general finding, i.e., that interpretations are not robust or reliable, is definitely of relevance for the general acceptance and security of deep learning systems in practice.
What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: