Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang, Manolis Savva, Thomas A. Funkhouser. Semantic Scene Completion from a Single Depth Image. CVPR, 2017.

Song et al. Use 3d convolutional networks for joint scene completion and semantic labeling – their approach is called semantic scene completion network. In 3d, scene completion describes the task of predicting volumetric occupancy (e.g. in a voxel grid) from a given RGBD input image. The task is tackled jointly by also predicting a category label for each voxel. An high-level overview of the approach is given in Figure 1. In their paper, they elaborate on the following problems: Data representation, network architecture able to leverage scene context, and synthetic training data generation.

Figure 1: High-level overview of the proposed semantic scene completion network. Given an RGBD input image and the corresponding point cloud, it is fed through a network consisting of several convolutional layers including skip connections and newly introduced dilated convolutional layers. Overall, the goal is to increase the size of the receptive field and learn context. The output is a (slightly smaller) voxel grid containing the semantic scene completion labeling.

As data representation they propose to use a flipped version of the Truncated Signed Distance Function to avoid strong gradients in empty space along occlusion boundaries. It is computed as

$d_{flipped} = \text{sign}(d)(d_{\text{max}} -d)$.

The network architecture follows a regular convolutional network architecture with additional dilated convolutions [35] and skip connections to fuse information from multiple scales and increase the receptive field. The architecture is shown in Figure 1. Note that the pooling layers only reduce the size to one fourth of the original size. This being said, the network directly predicts the voxel labeling for the whole scene. The input volumes are rotated to align with gravity direction and the rooms are oriented to match the Manhatten world assumption. The input volumes are of size $240 \times 144 \times 240$. A voxel-wise softmax loss is used for training.

For training data, they present SUNCG, a large-scale dataset consisting of synthetic, labeled scenes built using an interior design planner (concretely Planner5D [25]). Using Planner5D they created multi-room layouts of apartments where all objects have been manually labeled with category labels. For training, depth maps were generated by rendering the scenes from different view points while incorporating typical characteristics of Microsoft’s Kinect.

On the NYU Depth V2 Dataset [29], they present quantitative and qualitative results. For quantitative comparison to other approaches, refer to the paper. Figure 2 shows qualitative results of their approach. In several experiments they come to several conclusions: Scene completion and bigger receptive fields help recognizing object, i.e. improve the quality of the semantic labeling. On the other hand, semantic information and scene context also helps scene completion. They also prove the effectiveness of their multi-scale network as well as the advantage of the proposed flipped Truncated Signed Distance Function.

Figure 2: Qualitative results showing the input depth map and color image, the corresponding voxel grid, the ground truth (which was obtained by fitting CAD models to the NYU Depth V2 dataset), the prediction and the error.

  • [35] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.
  • [25] Planner5D. https://planner5d.com/.
  • [29] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from RGBD images. In ECCV, 2012.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.