Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu. Spatial Transformer Networks. NIPS, 2015.

Jaderberg et al. introduce spatial transformer networks, a differentiable module explicitly allowing networks to learn spatial transformations. One of the motivations of spatial transformer networks is to allow convolutional neural networks to explicitly learn invariant to specific spatial transformations. A spatial transformer network (or module) consists of the following two modules: a localization or transformation estimation network; and a differentiable sampler.

The localization network is a regular (convolutional network) estimating the parameters of a spatial transformation based on the input grid. The localization net is illustrated in blue in Figure 1.

Figure 1: Illustration of the basic structure of a spatial transformer module.

The sampling part can be broken down into the grid generator (in green in Figure 1) and the sampler (in grey in Figure 1). The grid generator uses the estimated parameters and computes a grid correspondence, i.e. it essentially transforms the input grid as illustrated in Figure 2.

Figure 2: Illustration of the grid generator.

The differentiable sampler then takes the generated grid and samples the input feature map accordingly. For this step, regular interpolation sampling strategies can be used. However, the step needs to be differentiable; in the case of bilinear interpolation, the equations are provided in the paper and can easily be implemented.

Implementations of spatial transformer networks for Torch are available on GitHub: ankurhanda/gvn and qassemoquab/stnbhwd.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.