D. Mishkin, J. Matas. All you need is a good init. CoRR, 2015.

Mishkin and Matas extend the orthonormal initialization scheme introduced in [] by additionally normalizing the initialized weights by the variance as measured on the first batch. The full procedure is summarized in Algorithm 1.

function LSUV_initialization(
        $\tau$ // variance tolerance
    pre-initialize weights with orthonormal matrices as in []
    for $l = 1,\ldots,L$ // for each layer
        // $x^{(l)}$ denotes the output tensor of layer $l$
        while $|\text{Var}[x^{(l)}] - 1| \geq \tau$ 
            forward pass to compute $x^{(l)}$
            calculate $\text{Var}[x^{(l)}]$
            // $W^{(l)}$ denotes the weights of layer $l$
            $W^{(l)} = \frac{W^{(l)}}{\sqrt{\text{Var}[x^{(l)}]}}$

Algorithm 1: LSUV - Layer-sequential unit-variance initialization.

The corresponding implementation is easy to understand and can be found here.

  • [] A. M. Saxe, J. L. McClelland, S. Ganguli .Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, 2013.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.