23^{th}JUNE2020

Etai Littwin, Lior Wolf. *Regularizing by the Variance of the Activations' Sample-Variances.* NeurIPS 2018.

Also find this summary on ShortScience.org.

What is **your opinion** on the summarized work? Or do you know related work that is of interest? **Let me know** your thoughts in the comments below:

Littwin and Wolf propose a activation variance regularizer that is shown to have a similar, even better, effect than batch normalization. The proposed regularizer is based on an analysis of the variance of activation values; the idea is that the measured variance of these variances is low if the activation values come from a distribution with few modes. Thus, the intention of the regularizer is to encourage distributions of activations with only few modes. This is achieved using the regularizers

$\mathbb{E}[(1 - \frac{\sigma_s^2}{\sigma^2})^2]$

where $\sigma_s^2$ is the measured variance of activation values and $\sigma^2$ is the true variance of activation values. The estimate $\sigma^2_s$ is mostly influenced by the mini-batch used for training. In practice, the regularizer is replaced by

$(1 - \frac{\sigma_{s_1}^2}{\sigma_{s_2}^2 + \beta})^2$

which can be estimated on two different batches, $s_1$ and $s_2$, during training and $\beta$ is a parameter that can be learned and mainly handles the case where the variance is close to zero. In the paper, the authors provide some theoretical bounds and also make a connection to batch normalization and in which cases and why the regularizer might be a better alternative. These claims are supported by experiments on Cifar and Tiny ImageNet.