Bernardino Romera-Paredes, Philip Hilaire Sean Torr. Recurrent Instance Segmentation. ECCV, 2016.

Romera-Paredes and Torr use convolutional recurrent neural networks, in particular convolutional LSTMs, for instance segmentation. The underlying idea to use recurrent network is that humans usually identify instances by counting sequentially. The corresponding high-level view of the proposed approach is illustrated in Figure 1 and consists of a fully convolutional network (e.g. FCN-8s [12]) and a convolutional LSTM. They also discuss the loss in order to train the architecture end-to-end.

Figure 1: High-level view of the proposed architecture comprising a fully convolutional network (FCN) and a convolutional LSTM predicting individual instance in each iteration.

Long short-term memory networks (LSTM) are among the most successful recurrent networks as they are able to prevent the vanishing gradient problem. Therefore, Romera-Paredes and Torr present a natural extension of LSTM units to convolutional LSTM modules. The high-level view of the module is shown in Figure 2. The general structure as well as the used gates remain the same except for the replacement of fully connected/linear layers with convolutional layers.

Figure 2: Diagram of a convolutional LSTM unit/layer.

Figure 3: Post-processing of the hidden state (i.e. the output of the LSTM) in order to obtain a instance confidence and the corresponding confidence segmentation map.

Given the architecture in Figure 1, each application of the convolutional LSTM produces two outputs: a feature map which will correspond to the instance segmentation, and a confidence score (as also illustrated in Figure 1 on the right). These two outputs also correspond to the hidden state of the LSTM. Both outputs are post-processed as in Figure 3 in order to obtain a confidence score in $[0,1]$ and a confidence map in $[0,1]$ for the instance segmentation. Then, defining a suitable loss is non-trivial. In particular, the predicted instance segmentations need to be matched against the ground truth instances. For this, they employ the Hungarian algorithm based on Intersection-over-Union. With $\delta_{\hat{t},t}$ denoting whether predicted instance $\hat{t}$ is assigned to ground truth instance $t$, the loss is stated as

$l(\hat{Y},s,Y) = \min_{\delta \in S} - \sum_{\hat{t} = 1}^\hat{n} \sum_{t = 1}^n f_{\text{IoU}} (\hat{Y}_{\hat{t}}, Y_t) \delta_{\hat{t},t} + \lambda \sum_{t = 1}^\hat{n} f_{\text{BCE}} ([t \leq n], s_t)$

For experiments, the network is trained using Adam [40] in a curriculum learning fashion. This means that first, the network is trained on two instances per image until convergence. Then the network is trained on all instances where the number of applications of the LSTM is set to $n + 2$ with $n$ being the ground truth number of instances. This way, the network learns when to stop (i.e. if the confidence becomes smaller than $0.5$). Qualitative results are shown in Figure 4 where a CRF is applied in addition in order to get more fine-grained localization.

Figure 4: Qualitative results showing the input image (left) the result of the proposed network (middle left) the results refined by a CRF (middle right) and the ground truth (right).

  • [12] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [40] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.