Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, Raia Hadsell. Progressive Neural Networks. CoRR abs/1606.04671 (2016).

Rusu et al. propose progressive networks, sets of networks allowing transfer learning over multiple tasks without forgetting. The key idea of progressive networks is very simple. Instead of fine-tuning a model (for transfer learning), the pre-trained model is taken and its weights fixed. Another network is then trained from scratch while receiving features from the pre-trained network as additional input.

Specifically, the authors consider a sequence of tasks. For the first task, a deep neural network (e.g. multi-layer perceptron) is trained. Assuming $L$ layers with hidden activations $h_i^{(1)}$ for $i \leq L$, each layer computes

$h_i^{(1)} = f(W_i^{(1)} h_{i-1}^{(1)})$

where $f$ is an activation function and for $i = 1$, the network input is used. After training until convergence, a second network is trained – now on a different task. The parameters of the first network is fixed, but the second network can use the features of the first one:

$h_i^{(2)} = f(W_i^{(2)} h_{i-1}^{(2)} + U_i^{(2:1)}h_{i-1}^{(1)})$.

This idea can be generalized to the $k$-the network, which can use the activations from all the previous networks:

$h_i^{(k)} = f(W_i^{(k)} h_{i-1}^{(k)} + \sum_{j < k} U_i^{(k:j)} h_{i-1}^{(j)})$.

For three networks, this is illustrated in Figure 1.

Figure 1: An illustration of the feature transfer between networks.

In practice, however, this approach results in an explosion of parameters and computation. Therefore, the authors apply a dimensionality reduction to the $h_{i – 1}^{(j)}$ for $j < k$. Additionally, an individual scaling factor is used to account for different ranges used in the different networks (also depending on the input data). Then, the above equation can be rewritten as

$h_i^{(k)} = f(W_i^{(k)} h_{i-1}^{(k)} + \sum_{j < k} U_i^{(k)} f(V_i^{(k)} \alpha_i^{(:k)} h_{i-1}^{(:k)})$.

(Note that notation has been adapted slightly, as I found the original notation misleading.) Here, $h_{i – 1}^{(:k)}$ denotes the concatenated features from all networks $j < k$. Similarly, for each network, one $\alpha_i^{(j)}$ is learned to scale the features (note that the notation above would imply a element-wise multiplication of the $\alpha_i^{(j)}$'s repeated in a vector, or equivalently a matrix-vector product). $V_i^{(k)}$ then describes a dimensionality reduction; overall, a one-layer perceptron is used to “transfer” features from networks $j < k$ to the current network. The same approach can also be applied to convolutional layers (e.g. a $1 \times 1$ convolution can be used for dimensionality reduction).

In experiments, the authors show that progressive networks allow efficient transfer learning (efficient in terms of faster training). Additionally, they study which features are actually transferred.

Also find this summary on ShortScience.org.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.