Morcos et al. study the influence of ablating single units as a proxy to generalization performance. On Cifar10, for example, a 11-layer convolutional network is trained on the clean dataset, as well as on versions of Cifar10 where a fraction of $p$ samples have corrupted labels. In the latter cases, the network is forced to memorize examples, as there is no inherent structure in the labels assignment. Then, it is experimentally shown that these memorizing networks are less robust to setting whole feature maps to zero, i.e., ablating them. This is shown in Figure 1. Based on this result, the authors argue that the area under this ablation curve (AUC) can be used as proxy for generalization performance. For example, early stopping or hyper-parameter selection can be done based on this AUC value. Furthermore, they show that batch normalization discourages networks to rely on these so-called single-directions, i.e., single units or feature maps. Specifically, batch normalization seems to favor units holding information about multiple classes/concepts.
What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: