Noroozi and Favaro present a self-supervised learning task similar to the one proposed by Doersch et al. . Specifically, they use Jigsaw puzzles to teach convolutional neural networks context and learn features useful for classification and detection. The overall idea is illustrated in Figure 1.
Figure 1 (click to enlarge): Illustration of the Jigsaw puzzle task: Selected tiles in the original image (with random gaps) on the left, the randomly permuted tiles in the middle, and the solution on the right.
The presented architecture is shown in Figure 2 and consists of $9$ AlexNets  with shared weights. The final computed/learned representations are fed into two fully connected layers and then to a softmax layer with $64$ outputs. The $64$ different possibilities correspond to one of $64$ different permutations used for the input tiles.
Figure 2 (click to enlarge): Illustration of the used architecture. The tiles are computed (including random gaps), then a permutation is applied - which the network is supposed to predict. The permuted tiles are fed into the $9$ AlexNets and the computed representations into two fully connected layers followed by a 64-way softmax.
They demonstrate the usefulness of the learned representations on ImageNet and Pascal VOC 2007. They also present an intuitive visualization. To this end, they compute the $L_1$ norm of feature maps in specific layers and present the top 16 patches (from different) images with largest $L_1$ norm. This illustrates that specific feature maps in specific layer correspond to individual semantic concepts. The visualizations are shown in Figure 3.
Figure 3 (click to enlarge): Illustration of the learned layers/feature maps as described in the text. For each layer, 6 hand-chosen feature maps are shown - for each feature map the 16 most relevant image patches are shown.
What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: