APRIL2017

X. Glorot, Y. Bengio. *Understanding the difficulty of training deep feedforward neural networks*. Aistats, 2010.

What is **your opinion** on this article? **Let me know** your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.

Glorot and Bengio propose a novel initialization scheme which today is often referred to as Xavier initialization (based on Glorot's first name). However, they also discuss many other aspects of training deep networks equipped with sigmoid, hyperbolic tangent or softsign activation functions. In the following, I want to briefly summarize some of the used visualizations and plots that support their investigations before detailing the theory behind Xavier initialization.

A major contribution of the paper unrelated to the proposed initialization scheme is the use of different visualization techniques in order to understand the inner workings of deep networks after initialization, during training and after training:

Figure 1 (

click to enlarge): Mean and standard deviation of the activations of all four layers during training when using sigmoid activation functions.Figure 2 (

click to enlarge): 98 percentiles (markers) and standard deviation (markers with line) corresponding to the activations during training for the hyperbolic tangent (top) and softsign (bottom).Figure 3 (

click to enlarge): Normalized activation histograms after training for all layers and the hyperbolic tangent (top) and the softsign (bottom).Figure 4 (

click to enlarge): Histograms of activations for each layer after initialization for the heuristic initialization (top) and the proposed intitialization (bottom).Figure 5 (

click to enlarge): Histograms of backpropagated gradients for each layer after initialization for the heuristic initialization (top) and the proposed initialization (bottom).Figure 6 (

click to enlarge): Standard deviation of the backpropagated gradients for each layer during training for the heuristic initialization (top) and the proposed initialization (bottom).The proposed initialization is motivated by the poor performance of the heuristic initialization where weights are initialized according to

$W_{ij} \sim U[-\frac{1}{\sqrt{n}}, \frac{1}{\sqrt{n}}]$

where $U[-a,a]$ is a uniform distribution and $n$ is the number of input values of the layer. For their derivation, they assume a general fully connected layer computing

$z^{i + 1} = f(s^i) = f(W^i z^i + b^i)$

The activation function $f$ is assumed to be symmetric and $f′(0) = 1$ (i.e. more or less linear around zero). Assuming to be in the linear regime (i.e. $f′(s^i)=1$) and zero mean of the weights and the input yields

$\text{Var}[z^i] = \text{Var}[x] \prod_{i' = 0}^{i - 1} n_{i'} \text{Var}[W^{i'}]$

$\text{Var}[\frac{\partial L}{\partial s^i}] = \text{Var}[\frac{\partial L}{\partial s^d}]\prod_{i' = i}^d n_{i' + 1}\text{Var}[W^{i'}]$

where $x$ is the input, $n_i$ is the number of units in layer $i$ and $L$ is the training loss. The idea is to keep information flowing, i.e.

$\forall i,i':$ $\text{Var}[z^i] = \text{Var}[z^{i'}]$

$\forall i, i':$ $\text{Var}[\frac{\partial L}{\partial s^i}] = \text{Var}[\frac{\partial L}{\partial s^{i'}}]$

which can be equivalently stated as requiring

$\forall i:$ $n_i\text{Var}[W^i] = 1$,

$\forall i:$ $n_{i + 1}\text{Var}[W^i] = 1$.

So the goal is to have

$\forall i:$ $\text{Var}[W^i] = \frac{2}{n_i + n_{i + 1}}$

as a compromise. This results in

$W \sim U[-\frac{\sqrt{6}}{\sqrt{n_j + n_{j + 1}}}, \frac{\sqrt{6}}{\sqrt{n_j + n_{j + 1}}}]$,

the so-called Xavier initialization.