IAM

FEBRUARY2018

READING

Tim Salimans, Diederik P. Kingma. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. NIPS, 2016.

Salimans and Kingma introduce weight normalization, a reparameterization of network weights in order to improve training. In particular, considering the general, non-linear neural network model, where each layer computes a vector along the lines of

$y = \psi(w\cdot x + b)$

Then, the proposed reparameterization takes the form:

$w = \frac{g}{\|v\|}v$

i.e. can be understood as separating direction and magnitude. Instead of optimization with regard to $w$, Salimans and Kingma propose to optimize with regard to $g$ and $v$. The corresponding derivatives takes the form

$\nabla_g L=\frac{\nabla_w L \cdot v}{\|v\|}$, $\nabla_v L = \frac{g}{\|v\|} \nabla_w L - \frac{g\nabla_g L}{\|v\|^2} v$

An important property is that the gradient with respect to $v$ can be rewritten as:

$\nabla_v L = \frac{g}{\|v\|} M_w \nabla_w L$, with $M_w = I - \frac{ww^T}{\|w\|^2}$$

Note that $M_w$ takes the form of a Householder transformation on the normalized vector $w/\|w\|^2$.

Based on the presented reparaetermization and its interpretation using the Householder transform, an important aspect is that the norm of $v$ grows monotonically during optimization. While the mathematical background can be found in the paper, this implies that the norm will quickly increase if the gradients are noisy. However, this also decreases the scaling factor $g$. In this sense, the scaled gradient self-stabilizes during training. This also makes training more robust with regard to the learning rate.

Batch normalization can be seen as general case. In particular, weight normalization corresponds to batch normalization on whitened input (when considering only one layer at a time). For a convolutional layer, though, this does not hold due to the spatially normalizing effect of batch normalization. Salimans and Kingma additionally propose a novel initialization scheme. The weights are first initialized randomly with zero mean and $0.05$ standard deviation. Given the first batch, the mean and variance of the pre-activations (activations before non-linearity) are computed. Then, the biases are set as

$b = \frac{-\mu[t]}{\sigma[t]}$

and the scale $g$ is initialized as

$g = \frac{1}{\sigma[t]}$

Unfortunately, this schemes seems to be tailored to one layer. It is unclear (and left unclear) how this is extended to multiple layer, although they explicitly run experiments involving multi-layer architectures. They also do not discuss the special case of convolutional layers. Overall, it seems reasonable, that this scheme is applies to every layer separately, such that in multi-layer networks multiple scaling factors g are present and have to be considered during optimization.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.