Jaderberg et al. introduce spatial transformer networks, a differentiable module explicitly allowing networks to learn spatial transformations. One of the motivations of spatial transformer networks is to allow convolutional neural networks to explicitly learn invariant to specific spatial transformations. A spatial transformer network (or module) consists of the following two modules: a localization or transformation estimation network; and a differentiable sampler.
The localization network is a regular (convolutional network) estimating the parameters of a spatial transformation based on the input grid. The localization net is illustrated in blue in Figure 1.
The sampling part can be broken down into the grid generator (in green in Figure 1) and the sampler (in grey in Figure 1). The grid generator uses the estimated parameters and computes a grid correspondence, i.e. it essentially transforms the input grid as illustrated in Figure 2.
The differentiable sampler then takes the generated grid and samples the input feature map accordingly. For this step, regular interpolation sampling strategies can be used. However, the step needs to be differentiable; in the case of bilinear interpolation, the equations are provided in the paper and can easily be implemented.