D. Mishkin, J. Matas. All you need is a good init. CoRR, 2015.

Mishkin and Matas extend the orthonormal initialization scheme introduced in [] by additionally normalizing the initialized weights by the variance as measured on the first batch. The full procedure is summarized in Algorithm 1.

function LSUV_initialization(
        $\tau$ // variance tolerance
    pre-initialize weights with orthonormal matrices as in []
    for $l = 1,\ldots,L$ // for each layer
        // $x^{(l)}$ denotes the output tensor of layer $l$
        while $|\text{Var}[x^{(l)}] - 1| \geq \tau$ 
            forward pass to compute $x^{(l)}$
            calculate $\text{Var}[x^{(l)}]$
            // $W^{(l)}$ denotes the weights of layer $l$
            $W^{(l)} = \frac{W^{(l)}}{\sqrt{\text{Var}[x^{(l)}]}}$

Algorithm 1: LSUV - Layer-sequential unit-variance initialization.

The corresponding implementation is easy to understand and can be found here.

  • [] A. M. Saxe, J. L. McClelland, S. Ganguli .Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, 2013.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: