Zhang et al. propose a 3D indoor scene understanding approach utilizing deep networks and learned scene templates. A high-level overview of the approach looks as follows: Based on manually annotated category labels, a network is trained to classify scenes into sleeping room, office, lounging area and office&chair. For each of these categories, a template is learned by aligning the scenes according to the major object, e.g. bed or desk. For each scene, all object categories are clustered using k-means and the top clusters represent objects in the scene template, see Figure 0. The idea is that a new scene (at test time) is first aligned using network estimating rotation and translation, and the scene templates are then used to detect and classify objects in the scene. This is illustrated in Figure 1.
Figure 1: High-level overview of the proposed approach. The used scene templates are shown in Figure 0. Note that the input is a 3D volume using the Truncated Signed Distance representation .
Instead of discussion the individual steps in detail, I refer to the paper and mention some details that are interesting. In experiments, Zhang et al. show the importance of bounding box regression, which is already known to work well for 2D object detection, in 3D. Furthermore, two other factors seem to boost performance significantly: scene classification, i.e. categorizing scenes according to major themes such as bedroom or living room, and estimating a common alignment of the scenes. The former part might be important for other object detection/scene understanding approaches, while the latter might be due to the used templates which assume aligned scenes. Furthermore, the network architecture, as illustrated in Figure 2, is trained separately for each scene category. Still context is induced through the "Scene Pathway" (see Figure 2) which corresponds to a network initially trained for scene classification and then fine-tuned for the individual scenes. This is an interesting idea of incorporating context features.
Figure 2: Proposed network architecture consisting of a pre-trained "Scene Pathway" network and a scene-specific "Object Pathway" network largely following ideas from .
Finally, I want to note that they highly augment the SUNRGBD  dataset by randomly replacing objects with CAD models from the ShapeNetCore  dataset. This way they are able to increase training set size allowing to train the presented models. Qualitative results of alignment as well as full scene understanding are shown in Figure 3.
Figure 3: Top: alignment results for sleeping area and lounging area. Bottom: Results for full scene understanding.