Misra et al. propose to use temporal order verification to learn features for action recognition in an unsupervised fashion. In particular, temporal order verification means that given a triple of images, a convolutional neural network as to decide weather the images are given in the correct temporal order. For this task, they use the AlexNet  architecture to compute features for all three images (with shared weights). The features are concatenated and then fed into another fully connected layer for classification, see Figure 1.
Figure 1 (click to enlarge): Illustration of the sampling process to obtain appropriate triples for learning (left) and the used architecture (right).
For sampling appropriate triples, they focus on subsequences with large motion (measured by the magnitude of optical flow). Otherwise, negative and positive triples would be hard to distinguish. This way, they sample five different frames as illustrated in Figure 1 and construct negative and positive samples.
In experiments, they show that the learned features are beneficial for action recognition. As visualization, they obtain nearest neighbors using the fully connected layer fc7 (see Figure 1). The results are shown in Figure 2 and compared the AlexNet as pre-trained on ImageNet only.
Figure 2 (click to enlarge): Nearest neighbor visualizations. For the query image on the left, the nearest neighbor is usually from the same video (first row). After discarding these trivial nearest neighbors, the shown nearest neighbors in the second row mostly correspond to similar actions for the proposed model.