A. M. Saxe, J. L. McClelland, S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, 2013.

Saxe et al. give a mathematically concise discussion of deep linear networks in order to evaluate the advantage of pre-training for initialization. While I highly recommend the read for all machine learning practitioners interested in deep learning, the involved mathematics exceeds the intended scope of my reading notes — therefore, I only give the main conclusions as also used in the literature (e.g. in [] to refine the proposed initialization scheme).

  • Pre-training using auto-encoders will improve convergence and yield better performance if the input-output correlation resembles the input-input correlations. Without mathematical argument, this might be interpreted as the auto-encoder loss yielding useful kernels/weights for the actual supervised task.
  • Instead of using random, Gaussian initialization of weight matrices, Saxe et al. recommend to initialize the weight matrices by random orthogonal matrices. This can also be extended to convolutional neural networks as for example discussed in Hendrik Weideman's blog.
  • [] D. Mishkin, J. Matas. All you need is a good init. CoRR, 2015.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.