S. Ioffe, C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML, 2015.

Update. A simple implementation of batch normalization in Tensorflow can be found in this snippet.

Ioffe and Szegedy present batch normalization, a layer intended to normalize the input distribution to individual layers in deep networks in order to improve training and performance. The overall motivation can be stated as follows: In machine learning, it has been known that normalization and whitening operations, making sure that the input distribution during training and testing is approximately the same, improves training and performance. Ioffe and Szegedy argue that this also holds for individual subnetworks (and layers) within deep networks. However, in deep networks, the input distribution of a specific subnetwork changes during training due to changing parameters in "lower" subnetworks. To accommodate these changes, they propose to normalize the output of subnetworks over the current batch (in case of mini-batch training).

The used normalization scheme

$\hat{x}^{(k)} = \frac{x^{(k)} - E[x^{(k)}]}{\sqrt{\text{Var}[x^{(k)}]}}$

normalizes the output of a fully connected layer $x^{(k)}$ by the mean and standard deviation over the current mini-batch. Note that this approach includes two simplifications: it is prohibitive to normalize over the whole training set such that mini-batches are used instead; and decorrelation is computational expensive such that all variables are treated independently. The overall scheme is summarized in Algorithm 1. Note that this scheme is differentiable and backpropagation can, therefore, be performed as usual. The corresponding derivatives can be found in the paper or derived easily using the chain rule and Algorithm 1. Note that for testing, the normalization scheme of each batch normalization layer still follows the above equations but mean and standard deviation are estimated over the whole population.

function batch_normalization(
        $X = \{x_1,\ldots,x_m\}$, // mini-batch of layer output
        $\gamma, \beta$ // learned parameters
    $\mu = \frac{1}{m} \sum_{i = 1}^m x_i$
    $\sigma^2 = \frac{1}{m} \sum_{i = 1}^m (x_i - \mu)^2$
    $\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$
    $y_i = \gamma \hat{x}_i + \beta$
    return $y_i$

Algorithm 1: The operations involved in a batch normalization layer.

For convolutional neural networks, Ioffe and Szegedy introduce what could be called convolutional batch normalization. The normalization scheme remains the same, but to follow the overall framework of convolutional neural networks, normalization is performed across spatial locations as well as across the mini-batch samples.

For practical use, Ioffe and Szegedy demonstrate the usefulness on MNIST and ImageNet. While the specific performance can be found in the paper, they provide some useful best practices to use batch normalization:

  • The learning rate usually can be increased. This has been shown on ImageNet.
  • Dropout layers can be dropped.
  • Weight decay, i.e. $L_2$ regularization, can be reduced.
  • The learning rate decay can be accelerated.
  • Local response normalization layers can be removed.

During experiments, they also visualize the input distributions for specific layers. I found this visualization to be quite interesting for monitoring training.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.