David Alvarez-Melis, Tommi S. Jaakkola. Towards Robust Interpretability with Self-Explaining Neural Networks. NeurIPS, 2018.

Alvarez-Melis and Jaakkola propose three requirements for self-explainable models, explicitness, faithfulness and stability, and construct a self-explainable, generalized linear model optimizing for these properties. In particular, the proposed model has the form

$f(x) = \theta(x)^T h(x)$

where $\theta(x)$ are features (e.g., from a deep network) and $h(x)$ are interpretable features/concepts. In practice, these concepts are learned using an auto-encoder from the raw input while the latent code, which represents $h(x)$, is regularized to learn concept under weak supervision. Additionally, the classifier is regularized to be locally difference-bounded by the concept function $h(x)$. This means that for each point $x_0$ it holds

$\|f(x) – f(x_0)\| \leq L \|h(x) – h(x_0)\|$ for all $\|x – x_0\|_\delta$

for some $\delta$ and $L$. This condition leads to some stability of interpretations with respect to the concepts $h(x)$. In practice, this is enforced through a regularizer.

In experiments, the authors argue that this class of models has advantages regarding the following three properties of self-explainable models: explicitness, i.e., whether explanations are actually understandable, faithfulness, i.e. whether estimated importance of features reflects true relevance, and stability, i.e., robustness of interpretations against small perturbations. For some of these conditions, the authors propose quantitative metrics; robustness, for example, can be evaluated using

$\arg\max_{\|x’ - x\|\leq\epsilon} \frac{\|f(x) – f(x’)}{\|h(x) – h(x’)\|}$

which is very similar to practically evaluating adversarial robustness.

Also find this summary on ShortScience.org.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.