# DAVIDSTUTZ

Check out our latest research on adversarial robustness and generalization of deep networks.
22thMARCH2018

Bernardino Romera-Paredes, Philip Hilaire Sean Torr. Recurrent Instance Segmentation. ECCV, 2016.

Romera-Paredes and Torr use convolutional recurrent neural networks, in particular convolutional LSTMs, for instance segmentation. The underlying idea to use recurrent network is that humans usually identify instances by counting sequentially. The corresponding high-level view of the proposed approach is illustrated in Figure 1 and consists of a fully convolutional network (e.g. FCN-8s [12]) and a convolutional LSTM. They also discuss the loss in order to train the architecture end-to-end.

Long short-term memory networks (LSTM) are among the most successful recurrent networks as they are able to prevent the vanishing gradient problem. Therefore, Romera-Paredes and Torr present a natural extension of LSTM units to convolutional LSTM modules. The high-level view of the module is shown in Figure 2. The general structure as well as the used gates remain the same except for the replacement of fully connected/linear layers with convolutional layers.

Given the architecture in Figure 1, each application of the convolutional LSTM produces two outputs: a feature map which will correspond to the instance segmentation, and a confidence score (as also illustrated in Figure 1 on the right). These two outputs also correspond to the hidden state of the LSTM. Both outputs are post-processed as in Figure 3 in order to obtain a confidence score in $[0,1]$ and a confidence map in $[0,1]$ for the instance segmentation. Then, defining a suitable loss is non-trivial. In particular, the predicted instance segmentations need to be matched against the ground truth instances. For this, they employ the Hungarian algorithm based on Intersection-over-Union. With $\delta_{\hat{t},t}$ denoting whether predicted instance $\hat{t}$ is assigned to ground truth instance $t$, the loss is stated as

$l(\hat{Y},s,Y) = \min_{\delta \in S} - \sum_{\hat{t} = 1}^\hat{n} \sum_{t = 1}^n f_{\text{IoU}} (\hat{Y}_{\hat{t}}, Y_t) \delta_{\hat{t},t} + \lambda \sum_{t = 1}^\hat{n} f_{\text{BCE}} ([t \leq n], s_t)$

For experiments, the network is trained using Adam [40] in a curriculum learning fashion. This means that first, the network is trained on two instances per image until convergence. Then the network is trained on all instances where the number of applications of the LSTM is set to $n + 2$ with $n$ being the ground truth number of instances. This way, the network learns when to stop (i.e. if the confidence becomes smaller than $0.5$). Qualitative results are shown in Figure 4 where a CRF is applied in addition in order to get more fine-grained localization.

• [12] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
• [40] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below or get in touch with me: