Based on object detections and instance segmentations from , Gupta et al. propose a convolutional neural network for rough pose prediction used as basis for CAD model fitting to improve scene understanding. Their approach is summarized in Figure 1. Their approach has three components. The first is an instance segmenter . Then, a convolutional neural network is trained for pose estimation and a set of CAD models is fitted using an adapted Iterative Closest Point (ICP) algorithm.
Figure 1: Illustration of the proposed approach where the first part (instance segmentation) is taken from .
The convolutional neural network consists of three convolutional layers followed by pooling, dropout, local response normalization and ReLU activations. It takes as input the 3-channel surface normals (encoding the angle between the normal and one of the axes) and outputs a binned ose estimate. The model is trained on synthetic data from ModelNet . Bounding boxes with overla of 0.7 are sampled randomly, the content is warped to generate positive samples.
Given the instance segmentation and a rough pose estimate, a search of different models and scales is used to infer the optimal scale, rotation and translation for each model. As clue for the scale, they use the area of the top view of the bounding box. During the search, models are scaled to fit this area. For each model with the selected scale, ICP is used to solve for the rotation and translation. The translation is initialized by assuming the object to stand on the floor and using the mean of the instance segmentation. The rotation is initialized according to the coarse pose estimate. As multiple models are fitted, a linear classifier is learned to select the best candidate.
Qualitative results on the NYU Depth V2 dataset  are shown in Figure 2.