Long Jin, Zeyu Chen, Zhuowen Tu. Object Detection Free Instance Segmentation With Labeling Transformations. CoRR, 2016.

Jin et al. propose a proposal- and detection-free pipeline for instance segmentation. Their approach is, as illustrated in Figure 1, based on three steps: an initial semantic segmentation, an instance label transformation (i.e. the choice of representation to use for predicting instance labels), and integrating the instance labels with the semantic segmentation. This pipeline, based on a semantic segmentation, is in contrast to many other approaches based on object detectors/proposal detectors.

Figure 1: High-level view of the proposed pipeline.

Figure 2: Illustration of the different instance label representations.

For semantic labeling, they make use of the work in [5] without CRF post-processing. This semantic segmentation is later merged with an inferred instance labeling. Jin et al. Propose three different representations of the instance labling (all three illustrated in Figure 2):

  • pixel-based affinity mapping: instead of predicting instance numbers (which are subject to permutations), $5 \times 5$ pixel patches are extracted and based on the instance labels a $25 \times 25$ affiny matrix is constructed. The affinity matrices over all these patches are clustered into $100$ classes using $k$-means. The network learns to predict these classes such that a global affinity map can be reconstructed by projecting the $5 \times 5$ affinity patches into the images (the affinity patches are reconstructed from the predicted cluster). Given the affinity matrix, a spectral clustering algorithm is used to segment the instances.
  • superpixel-based affinity mapping: given a superpixel segmentation, additional convolutional layers predict the affinity of two selected superpixels. For this, the FCN features within both superpixels are concatenated and fed through the convolutional layers. The affinities integrate into the semantic segmentation similarly as the pixel-based affinity transformation above.
  • boundary-based method: Instead of encoding instances though affinity maps, the idea is to diretly predict boundaries between instances. The predicted boundary pixels can then be labeled as background and a connected component algorithm can identity instances.

Figure 3: Qualitative results for all three transformations. From left to right (for each row): input image, ground truth segmentation with instances, instance prediction using connected components, pixel-based affinity representation, superpixel-based affinity representation and boundary-based representation.

Overall, Jin et al. replace the overhead of object/proposal detectors used in other works with an additional overhead on the segmentation side. Qualitative results are shown in Figure 3.

  • [5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915, 2016.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.