IAM

APRIL2017

READING

X. Glorot, Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. Aistats, 2010.

Glorot and Bengio propose a novel initialization scheme which today is often referred to as Xavier initialization (based on Glorot's first name). However, they also discuss many other aspects of training deep networks equipped with sigmoid, hyperbolic tangent or softsign activation functions. In the following, I want to briefly summarize some of the used visualizations and plots that support their investigations before detailing the theory behind Xavier initialization.

A major contribution of the paper unrelated to the proposed initialization scheme is the use of different visualization techniques in order to understand the inner workings of deep networks after initialization, during training and after training:

  • Plotting mean and standard deviation of the activations (after applying the non-linearity) for each layer during training. This is illustrated in Figure 1 using sigmoid activations. It is easy to spot saturation in different layers.
  • Standard deviation and selected percentiles (Glorot and Bengio use the 98 percentile) of the activations (after the non-linearity) for each layer during training. This is illustrated for the hyperbolic tangent and softsign in Figure 2. It is easy to see when the layers start saturating in the case of the hyperbolic tangent.
  • Activation histograms, i.e. normalized histograms of activations (after the non-linearity) for each layer after training (averaged over multiple training examples if necessary). This is illustrated in Figure 3 for the hpyerbolic tangent and the softsign. Here it is easy to see that most of the activations are in the saturated area of the hyperbolic tangent.
  • Activation histograms (after the non-linearity) after initialization for each layer. This is illustrated in Figure 4 for heuristic random initialization and the proposed initialization scheme, see below. The backpropagated gradients can be visualized analogously, see Figure 5.
  • Standard deviation of the backpropagated gradients for each layer during training. This illustrates that the proposed initialization scheme allows to keep basically the same variance across layers. This is illustrated in Figure 6.

Figure 1 (click to enlarge): Mean and standard deviation of the activations of all four layers during training when using sigmoid activation functions.

Figure 2 (click to enlarge): 98 percentiles (markers) and standard deviation (markers with line) corresponding to the activations during training for the hyperbolic tangent (top) and softsign (bottom).

Figure 3 (click to enlarge): Normalized activation histograms after training for all layers and the hyperbolic tangent (top) and the softsign (bottom).

Figure 4 (click to enlarge): Histograms of activations for each layer after initialization for the heuristic initialization (top) and the proposed intitialization (bottom).

Figure 5 (click to enlarge): Histograms of backpropagated gradients for each layer after initialization for the heuristic initialization (top) and the proposed initialization (bottom).

Figure 6 (click to enlarge): Standard deviation of the backpropagated gradients for each layer during training for the heuristic initialization (top) and the proposed initialization (bottom).

The proposed initialization is motivated by the poor performance of the heuristic initialization where weights are initialized according to

$W_{ij} \sim U[-\frac{1}{\sqrt{n}}, \frac{1}{\sqrt{n}}]$

where $U[-a,a]$ is a uniform distribution and $n$ is the number of input values of the layer. For their derivation, they assume a general fully connected layer computing

$z^{i + 1} = f(s^i) = f(W^i z^i + b^i)$

The activation function $f$ is assumed to be symmetric and $f′(0) = 1$ (i.e. more or less linear around zero). Assuming to be in the linear regime (i.e. $f′(s^i)=1$) and zero mean of the weights and the input yields

$\text{Var}[z^i] = \text{Var}[x] \prod_{i' = 0}^{i - 1} n_{i'} \text{Var}[W^{i'}]$

$\text{Var}[\frac{\partial L}{\partial s^i}] = \text{Var}[\frac{\partial L}{\partial s^d}]\prod_{i' = i}^d n_{i' + 1}\text{Var}[W^{i'}]$

where $x$ is the input, $n_i$ is the number of units in layer $i$ and $L$ is the training loss. The idea is to keep information flowing, i.e.

$\forall i,i':$ $\text{Var}[z^i] = \text{Var}[z^{i'}]$

$\forall i, i':$ $\text{Var}[\frac{\partial L}{\partial s^i}] = \text{Var}[\frac{\partial L}{\partial s^{i'}}]$

which can be equivalently stated as requiring

$\forall i:$ $n_i\text{Var}[W^i] = 1$,

$\forall i:$ $n_{i + 1}\text{Var}[W^i] = 1$.

So the goal is to have

$\forall i:$ $\text{Var}[W^i] = \frac{2}{n_i + n_{i + 1}}$

as a compromise. This results in

$W \sim U[-\frac{\sqrt{6}}{\sqrt{n_j + n_{j + 1}}}, \frac{\sqrt{6}}{\sqrt{n_j + n_{j + 1}}}]$,

the so-called Xavier initialization.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.