C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich. Going deeper with convolutions. CVPR, 2015.

Szegedy et al. propose GoogleNet, based on many of the discussed Inception modules, for image recognition and object detection. Their main contribution is the so-called Inception module, motivated by the work of Aurora et al. []. The inception module is shown in Figure 1 where $1 \times 1$ convolutional layers in front of the $3 \times 3$ and $5 \times 5$ convolutional layers are supposed to reduce the dimensionality. There are two main motivations of the Inception module (as I see it):

  1. Practically, improving performance is usually possible by designing deeper (and wider) networks. However, naively adding more and more convolutional or fulyl connected layers usually results in prohibitive computational requirements. The Inception module allows to deisng deeper networks while making efficient use of computational resources.
  2. Theoretically, addressing the increasing network size when designing deeper networks would be possible when introducing sparsity (in all layers and representations). The Inception module is then meant as approximation to the optimal sparse network structure, however, expressed using dense operations.

Figure 1 (click to enlarge): Illustration of the Inception module.

The concrete incarnation of the Inception idea is presented and called GoogleNet. Using GoogleNet, trained on the ImageNet dataset, they show improved performance on ImageNet.

  • [] S. Arora, A. Bhaskara, R. Ge, T. Ma. Provable Bounds for Learning Some Deep Representations. ICML, 2014.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.