Doersch et al. present a context prediction task for convolutional neural networks in order to learn useful features for object detection in an unsupervised (or "self-supervised") fashion. The underlying idea is to let the convolutional neural network predict the relative position of two patches, this is illustrated in Figure 1.
Figure 1 (click to enlarge): Illustration of the "self-supervised" task. Given the two patches on the left, the task is to decide where, relative to the first patch, the second patch es extracted. The idea is to learn useful features for object recognition.
The task is learned using a AlexNet  architecture for each patch, fused in an additional fully connected layer. The two AlexNet architectures share weights. This is illustrated in Figure 2.
Figure 2 (click to enlarge): Illustration of the used architecture. The dotted lines refer to shared weights.
In order to successfully train the architecture, Doersch et al. use color-dropping, i.e. dropping one color channel and replacing it by random noise, in order to avoid learning the underlying chromatic aberration (which refers to one color channel shrinking towards the image center).
Doersch et al. apply the approach to learn features for object detection on PASCAL VOC 2007, see paper for results, and to visualize similar patches as shown in Figure 3.
Figure 3 (click to enlarge): Illustration of nearest neighbors retrieved using the fully connected layers as features, see the paper for details.
What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below or using the following platforms: