Liang et al. tackle the problem of instance-level segmentation using their so-called Proposal Free Network. This network tackles the following three subtasks: category-level segmentation, instance location prediction and instance number prediction. The important aspects of their network is shown in Figure 1 – note that two separate networks are described, one for category-level segmentation, and the other for instance location and number prediction. Both networks are based on the pre-trained VGG16 network .
While a detailed description of the network architecture can be found in the paper, I want to mention some aspects. Given an input image, a category-level segmentation is obtained using a VGG16 network fine-tuned on PASCAL VOC 2012. On top of this fine-tuned network, an instance-level network is trained (still, the two networks are completely separate). The latter network preserves architecture while adding multi-scale components where feature maps after each convolutional layer are extracted and trained on a separate loss to predict instance location. The instance location is encoded by center, top-left and bottom-right corners (Liang et al. Note that this redundancy increases performance). The predictions from multiple scales are down-sampled and trained to predict the final instance location. This allows the network to incorporate information across scales – a technique also used by regular category-level segmentation approaches. On top of the VGG16 network, the 1024 feature maps are combined with the instance location prediction in order to predict instance numbers.
As mentioned above, the category-level and instance-level networks are trained separately. The category-level network is trained using a per-pixel cross entropy loss. For instance location prediction, a smooth l1 loss  is used for all foregroundclasses (background does not incur any loss). The losses for each individual scale are combined with a loss for the final prediction. As additional input, this network is fed the pixel coordinates as two-channel image (i.e. the coordinates are fed to each scale). For the instance numbers an Euclidean loss is used (i.e. Euclidean loss between predicted instance number vector and ground truth for each category within the image). The background class is weighted higher for balancing.
During testing, the predicted instance locations are clustered, see the paper for details. Some qualitative examples of the model are shown in Figure 2.
Figure 1: Qualitative results showing input image, ground truth instance-level segmentation, the predicted category-level segmentation and instance-level segmentation.