C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna. Rethinking the Inception Architecture for Computer Vision. CVPR, 2016.

Szegedy et al. discuss several improvements and practical considerations of the Inception module (and the corresponding GoogleNet architecture) presented in []. One of the main motivations is that [] does not discuss the individual factors leading to the success of GoogleNet in detail. Overall, they present a relatively loose collection of considerations and advice which is finally put together in the Inception v2 and Inception v3 modules. First, I want to summarize some of the underlying design choices, before discussing Inception v2 and Inception v3 as well as the proposed label smoothing in detail.

Design choices:

  • Representational bottlenecks should be avoided.
  • Spatial aggregation (e.g. a $3 \times 3$ convolutional layer) can be applied after dimensionality reduction using a $1 \times 1$ convolutional layer.
  • Width and depth of a network should be increased/decreased in parallel, i.e. there is a balance between depth and width.

Unfortunately, Szegedy et al. do not give concrete examples or experiments illustrating this points, and not all aspects of these design choices are discussed in detail as part of the proposed Inception modules.

Inception v2: The Inception v2 module relies on the observation that large filters in convolutional layers can be factorized using smaller filters, thereby reducing the computational cost while maintaining the expressiveness of the network. As example, Szegedy et al. factorize a $5 \times 5$ filter as two successive $3 \times 3$ filters. They also discuss whether the first of these $3 \times 3$ filters should be linearly activated or using rectified linear units (ReLUs). They conclude that both filters should use ReLUs. The change in the Inception module is illustrated in Figure 1. They also discuss factorization of general $n \times n$ filters into a $1 \times n$ and a $n \times 1$ filter, but conclude that this schemes does not perform well in early stages and should only be used on small grid sizes.

Figure 1 (click to enlarge): Illustration of the first step towards the Inception v2 module: replacing $5 \times 5$ filters by two consecutive $3 \times 3$ filters. Left: before; right: after.

As another aspect of the Inception v2 module, Szegedy et al. discuss how to efficiently reduce the grid size. The idea is to avoid representational bottlenecks. Instead of pooling (keeping the number of feature maps fixed) and then a convolutional layer (increasing the number of feature maps), they use pooling and convolutional in parallel in order to double the number of feature maps without introducing a bottleneck. This is illustrated in Figure 2.

Figure 2 (click to enlarge): Grid size reduction through parallel pooling and convolution.

Overall, the Inception v2 module comes in different flavors. The full architecture is depicted in Figure 4 including the corresponding Inception v2 modules.

Figure 3: The architecture used in their paper (top) together with the different manifestations of the Inception v2 module (bottom). Originally, Figure 5 is depicted on the left, Figure 6 in the middle and Figure 7 on the right.

Inception v3: The Inception v3 module not really describes a different module compared to the Inception v2 module, but the combination of the Inception v2 module with the following techniques: RMSProp for training, label smoothing as discussed below, factorized $7 \times 7$ filters using consecutive $1 \times 7$ and $7 \times 1$ filters and the use of batch normalized auxiliary classifiers (see []).

Label smoothing: Label smoothing is intended to reduce the confidence of the trained network in its own predictions. In particular, training with cross-entropy loss forces the model to predict probability 1 for the ground truth label and 0 for the others. However, this results in overconfident models that may not generalize well. Szegedy et al. propose to regularize the cross-entropy loss using the following adapted version:

$H(q', p) = - \sum_{k = 1}^K \log p(k)q'(k) = (1 - \epsilon)H(q, p) + \epsilon H(U, p)$

where $q$ denotes the ground truth distribution (usually $1$ for one label and zero for all others), $p$ denotes the predictions of the network and $u$ a prior distribution over the labels (e.g. uniform or label frequencies).

  • [] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich. Going deeper with convolutions. CVPR, 2015.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.