I. Goodfellow, Y. Bengio, A. Courville. Deep Learning. Chapter 11, MIT Press, 2016.

Chapter 11 is among the most interesting chapters for deep learning practitioners that already have some background on the involved theory and algorithms. What Goodfellow et al. call "Practical Methodology" can best be described as a loose set of tips and tricks for approaching deep learning problems. They give the following, general process that should be followed:

  1. Define the problem to be solved including metrics used to access whether the problem was solved; it is also beneficial to define expected results in terms of the chosen metrics.
  2. Get a end-to-end prototype running that includes the selected metric.
  3. Incrementally do the following: diagnose a component (or aspect) that causes the system to under perform (e.g. hyperparameters, bugs, low-quality data, not enough data, model complexity etc.) and fix it.

Surprisingly, this approach has many parallels with modern, agile software engineering principles (e.g. prototyping, iterative development, risk focus).

Goodfellow et al. then discuss some of these aspects in detail. The most interesting points are made on diagnosing a running end-to-end system:

  • Visualize the results: do not focus on the quantitative results in terms of the selected metrics, also visualize the results to asses them qualitatively. This also includes looking at examples that are considered very difficult or very easy.
  • Always monitor training and test performance: also discussed in chapter 7, training and test performance may give important clues regarding hyperparameters or regularization such as early stopping. However, it might also help to decide whether bugs cause problems or underfitting/overfitting is a problem.
  • Try a tiny dataset: try a smaller or easier training set; this might be helpful to see whether bugs exist.
  • Monitor activations and gradients: monitoring the activations may provide clues about the model complexity and activation functions. Together with monitoring the gradients, e.g. the magnitude, might be helpful to asses optimization performance, problems with the hyperparameters etc.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.