Etai Littwin, Lior Wolf. Regularizing by the Variance of the Activations' Sample-Variances. NeurIPS 2018.

Littwin and Wolf propose a activation variance regularizer that is shown to have a similar, even better, effect than batch normalization. The proposed regularizer is based on an analysis of the variance of activation values; the idea is that the measured variance of these variances is low if the activation values come from a distribution with few modes. Thus, the intention of the regularizer is to encourage distributions of activations with only few modes. This is achieved using the regularizers

$\mathbb{E}[(1 - \frac{\sigma_s^2}{\sigma^2})^2]$

where $\sigma_s^2$ is the measured variance of activation values and $\sigma^2$ is the true variance of activation values. The estimate $\sigma^2_s$ is mostly influenced by the mini-batch used for training. In practice, the regularizer is replaced by

$(1 - \frac{\sigma_{s_1}^2}{\sigma_{s_2}^2 + \beta})^2$

which can be estimated on two different batches, $s_1$ and $s_2$, during training and $\beta$ is a parameter that can be learned and mainly handles the case where the variance is close to zero. In the paper, the authors provide some theoretical bounds and also make a connection to batch normalization and in which cases and why the regularizer might be a better alternative. These claims are supported by experiments on Cifar and Tiny ImageNet.

Also find this summary on ShortScience.org.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.