# DAVIDSTUTZ

Check out our latest research on weakly-supervised 3D shape completion.
31thJANUARY2018

Anurag Arnab, Philip H. S. Torr. Bottom-up Instance Segmentation using Deep Higher-Order CRFs. BMVC, 2016.

Arnab and Torr tackle semantic instance segmentation by combining recent success of convolutional neural networks for object detection and semantic segmentation. The problem of instance-level segmentation is illustrated in Figure 1, showing detections, a semantic segmentation and an instance segmentation. It becomes clear, that semantic segmentation and object detection approaches can highly benefit from being solved jointly.

A high -level view of the proposed approach is shown in Figure 2. They utilize R-CNN to obtain object detections and train a CNN similar to [27,37] to obtain semantic segmentations. In addition to the CRF on top of the semantic segmentation CNN as in [37], they also consider potentials derived from the detections. Given a densely-connected CRF (i.e. comprising unary and pairwise potentials), new random variables $Y_1,\ldots,Y_D$ are introduced – one variable for each detection. If a detection is found to be valid, the corresponding variable will be set to 1. Letting $F_d$ be the set of foreground pixels of a detection (obtained using GrabCut), the detection potential is given by

$\psi_d^{Det}(X_d = x_d, Y_d = y_d) = \begin{cases}w_l \frac{s_d}{|F_d|}\sum_{i = 1}^{|F_d|} [x_d^{(i)}=l_d] & \text{if } y_d=0\\ w_l \frac{s_d}{|F_d|} \sum_{i = 1}^{|F_d|} [x_d^{(i)} \neq l_d] & \text{if } y_d = 1\end{cases}$

where $X_d$ denote the pixel-wise variables corresponding to the labeling and $l_d$ is the label corresponding to the detection. A unary potential for the $Y_d$’s is initialized to the confidence score of the detector. After inference, the value is taken as the recalibrated confidence score.

After predicting a semantic segmentation using the adapted CRF, another CRF is used for instance identification an refinement. Instead of labeling all pixels inside a bounding box as a instance (i.e. if the bounding box prediction and the label coincide), Arnab and Torr assign a probability to each pixel within the bounding box:

$Pr(v_i = k) = \begin{cases}\frac{1}{Z(Y,Q)} Q_i(l_k) Pr(Y_k = 1) & \text{if } i \in B_k\\0 & \text{otherwise}\end{cases}$

Here, $v_i$ denotes the identified instance at pixel $i$ and as such takes values from $1,\ldots,D$. Further, $Q_i(l)$ denotes the output of the initial semantic segmentation and $Z$ normalizes the probability. In order to assign pixels to background, an additional detection $d_0$ is added. The instance CRF takes the form:

$E(v) = \sum_i \psi_i^U(v_i) + \sum_{i < j} \psi_{i,j}^P(v_i, v_j)$ $\psi_i^U(v_i) = - \ln Pr(v_i)$

where the pairwise potential encourages appearance and spatial consistency. Also note that this instance CRF is different for each image, i.e. the number of detections (and therefore, labels) changes from image to image. Both CRFs are unrolled and integrated into the convolutional neural network.

In experiments, they show that the introduced detection potentials increase performance. Some qualitative results are shown in Figure 3.

• [27] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
• [37] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip Torr. Conditional random fields as recurrent neural etworks. In ICCV, 2015.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below or get in touch with me: