Based on the popular R-CNN object detector, Gupta et al. tackle a set of problems ranging from object detection in RGB-D to instance segmentation. Overall, the presented pipeline consists of many different state-of-the-art approaches to individual problems such as proposal generation, instance segmentation and semantic segmentation – these are put together to work together. Here, I want to focus on their approach for instance segmentation.
To obtain region proposals, they extend Multiscale Combinatorial Grouping (MCG) to make use of the depth channel using improved contour detection based on  (also using depth information) and hand-crafted features, see the paper for details. They provide experimental evaluation demonstrating improved proposal deteciton on the NYU Depth Dataset V2 .
One of the main contributions of object detection lies in a a smart representation of the depth channel. Instead of providing a convolutional neural network with the RGB-D channels, they encode depth as so-called HHA. These three channels are the horizontal disoarity, the height above ground and the angle of the local surface with gravity direction. The channels are linearly scaled to $[0,255]$. Then the regular R-CNN architecture is applied (using AlexNet  pre-trained on ImageNet). For training, additional synthetic data is used.
For instance segmentation, their approach is very simple. The detected bounding box, the content is warped to a $50 \times 50$ grid. Each of these locations is then classified using a random forest to get an instance segmentation. The predictions are averaged within superpixels.
Unfortunately, the approach seems heavily engineered. In contrast to recent improvements of the R-CNN object detector, the system is not (and cannot) be trained end-to-end. However, this would be interesting to instance segmentation.
Qualitative results are shown in Figure 1.
Figure 1: Qualitative results of true positives and false positives (last two columns) and the inferred instance segmentations.