Mishkin and Matas extend the orthonormal initialization scheme introduced in [] by additionally normalizing the initialized weights by the variance as measured on the first batch. The full procedure is summarized in Algorithm 1.
function LSUV_initialization(
$\tau$ // variance tolerance
)
pre-initialize weights with orthonormal matrices as in []
for $l = 1,\ldots,L$ // for each layer
// $x^{(l)}$ denotes the output tensor of layer $l$
while $|\text{Var}[x^{(l)}] - 1| \geq \tau$
forward pass to compute $x^{(l)}$
calculate $\text{Var}[x^{(l)}]$
// $W^{(l)}$ denotes the weights of layer $l$
$W^{(l)} = \frac{W^{(l)}}{\sqrt{\text{Var}[x^{(l)}]}}$
Mishkin and Matas extend the orthonormal initialization scheme introduced in [] by additionally normalizing the initialized weights by the variance as measured on the first batch. The full procedure is summarized in Algorithm 1.
Algorithm 1: LSUV - Layer-sequential unit-variance initialization.
The corresponding implementation is easy to understand and can be found here.