Saxe et al. give a mathematically concise discussion of deep linear networks in order to evaluate the advantage of pre-training for initialization. While I highly recommend the read for all machine learning practitioners interested in deep learning, the involved mathematics exceeds the intended scope of my reading notes — therefore, I only give the main conclusions as also used in the literature (e.g. in  to refine the proposed initialization scheme).
What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: