L. J. Ba, R. Kiros, G. E. Hinton. Layer Normalization. CoRR, 2016.

Ba et al. introduce layer normalization, a simplification of batch normalization for better applicability to recurrent neural networks. However, they also demonstrate improved convergence for feed-forward networks, however do not discuss convolutional neural networks.

Batch normalization, given the activations $a_i^l$ of a specific layer $l$ (before feeding them into a non-linearity), normalizes them by mean and variance:

$\bar{a}_i^l = \frac{a_i^l - \mu_i^l}{\sigma_i^l}$, $\mu_i^l = \mathbb{E}[a_i^l]$, $\sigma_i^l = \sqrt{\mathbb{E}[(a_i^l - \mu_i^l)^2]}$

where the expectation is taken over the current mini-batch for simplification. The main reason for Ba et al. to propose a slightly different scheme is that batch normalization is hard to apply to recurrent neural networks (see section 3 in the paper for related work trying to get batch normalization working with recurrent neural networks). Therefore, they propose a simpler normalization scheme which is not dependent on the current mini-batch:

$\mu^l = \frac{1}{H}\sum_{i = 1}^H a_i^l$, $\sigma^l = \sqrt{\frac{1}{H} \sum_{i = 1}^H (a_i^l - \mu^l)^2}$

Here, $H$ is the number of activations in the layer. Simply but, the activations are normalized by their variance and mean in the current (single) training or testing sample. This scheme is applicable to recurrent neural networks and also easier to apply at testing time. The corresponding equations for recurrent neural networks can be found in the paper.

Experimental results on MNIST show superior convergence for models using layer normalization versus batch normalization. However, the model is a fully connected (i.e. not convolutional) model of size 784-1000-1000-10. The results can be found in Figure 1. For convolutional neural networks they state that batch normalization outpferforms layer normalization. Unfortunately, they do not give quantitative results. Additionally, it is unclear how layer normalization is extended to convolutional layers as batch normalization additionally normalizes across the spatial domain which might have a significant impact.

Figure 1 (click to enlarge): Training results for a fully connected model of size 784-1000-1000-10 for batch sizes 128 (left) and 4 (right). In both cases the negative log likelihood and the test error are shown.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.