Hoffer et al. Propose a unsupervised (or more precise self-supervised) training methodology for deep neural networks. Their work is in line with other work trying to learn representation sin a self-supervised fashion. Given an image, the proposed approach called spatial contrasting, takes to patches from the image (one anchor patch) and a random additional patch together with a "contrasting" patch from another image. Then, the goal is to simultaneously maximize the conditional probability $p(f_{anchor}|f_{positive})$ and minimize $p(f_{anchor}|f_{negative}) where $f$ denotes the features computed for the anchor patch, the positive patch and the negative patch. The loss is formed as
Hoffer et al. Propose a unsupervised (or more precise self-supervised) training methodology for deep neural networks. Their work is in line with other work trying to learn representation sin a self-supervised fashion. Given an image, the proposed approach called spatial contrasting, takes to patches from the image (one anchor patch) and a random additional patch together with a "contrasting" patch from another image. Then, the goal is to simultaneously maximize the conditional probability $p(f_{anchor}|f_{positive})$ and minimize $p(f_{anchor}|f_{negative}) where $f$ denotes the features computed for the anchor patch, the positive patch and the negative patch. The loss is formed as
$L_{SC}(x_1,x_2) = -\log\frac{\exp(-\|f_1^{(1)} - f_1^{(2)}\|_2}{\exp(-\|f_1^{(1)} - f_1^{(2)}\|2 + \exp(-\|f_1^{(1)} - f_2^{(1)}\|2}$
As this loss is symmetric with regard to the anchor patch nd the positive patch, the loss used for training is:
$\hat{L}_{SC} (x_1, x_2) = \frac{1}{2}[L_{SC}(x_1, x_2) + L_{SC}(x_2, x_1)]$
This approach is also illustrated in Figure 1.
Figure 1: Illustration of the approach.