Master Thesis Proposal “Shape Completion from Bounding Boxes using CAD Shape Priors”

Part of my master thesis at the Max Planck Institute for Intelligent Systems was an initial proposal — outlining the general idea and the current state-of-the-art. Specifically, I worked on learning 3D shape completion on KITTI using 3D bounding boxes only. In this article, I want to present this proposal.

In January this year I started my master thesis, advised by Prof. Andreas Geiger from the Max Planck institute for Intelligent Systems and Prof. Bastian Leibe from RWTH Aachen University. Initially, I started working on OctNet [5], a data structure for efficient deep learning in 3D. For example, I implemented a C++/CUDA implementation of batch normalization for OctNets which I intend to publish on GitHub soon. For the master thesis, however, I concentrated on learning shape completion without ground truth, e.g. on KITTI [6].

The master thesis is entitled:

Learning Shape Completion from Bounding Boxes with CAD Shape Priors


Deep Learning in 3D using OctNets. Recent convolutional neural networks (CNNs) are said to outperform human performance [1, 2, 3, 4] and are, thus, receiving considerable attention. However, researchers found that generalizing CNNs to (sparse) 3D data is difficult — at least in high resolutions. In 3D applications, for example on the ModelNet 3D shape classification benchmark [7], researchers resorted to shallower 3D CNNs trained in low resolution [8, 9, 11]. Furthermore, 3D CNNs have been found to perform inferior to CNNs trained on 2D projections of the shapes [10]. The hypothesis that this under-performance of 3D CNNs is due to the low resolution was recently approached in [5]. In particular, Riegler et al. propose a novel hybrid grid-octree data structure to represent sparse 3D data, illustrated in Figure 1a, enabling efficient training of so-called OctNets in high resolutions.

Shape Completion using CAD Shape Priors. Given a voxelized point cloud of an object (for example cars in the LiDAR data of the KITTI dataset [6]) shape completion describes the task of predicting a dense labeling of the volume clearly identifying the space occupied by the object. As the object is observed only partially, CAD models can be employed as prior on the object’s shape. In [12], for example, CAD models from a hand-selected database are fitted to the object using the Iterative Closest Point (ICP) algorithm (e.g. see [18]). In [13], in contrast, the problem of selecting appropriate CAD models is avoided by learning a latent space of shapes, as illustrated in Figure 1b. Shape completion is then formulated as energy minimization over this latent space. In [14], the energy minimization is also guided by stereo and image information. Still, inference requires minimizing a (highly) non-linear energy for each object.

End-to-End Learning of Shape Completion in 3D using Bounding Boxes. In this thesis, we want to formulate shape completion as weakly supervised, end-to-end learning task using OctNets. Motivated by models such as [15], a latent shape prior is learned using deep auto-encoders or similar generative models [16, 17]. Given the voxelized point cloud as input, a second network — the inference network — is trained to directly predict the latent representation corresponding to the correct shape. Here, in contrast to energy minimization approaches, the optimal shape is directly predicted in a single forward pass. While learning high-resolution shape completion is made feasible by using OctNets, one major challenge is the missing ground truth shape data. Although CAD models could be fitted manually in order to provide ground truth for supervision, we intend to train the inference network in an unsupervised fashion on bounding boxes. Thus, given the bounding box as well as camera information, the inference network is ultimately trained in a weakly supervised fashion.

  • [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun: Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015: 1026-1034.
  • [2] Karen Simonyan, Andrew Zisserman: Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014).
  • [3] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich: Going deeper with convolutions. CVPR 2015: 1-9.
  • [4] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton: ImageNet Classification with Deep Convolutional Neural Networks. NIPS 2012: 1106-1114.
  • [5] Gernot Riegler, Ali Osman Ulusoy, Andreas Geiger: OctNet: Learning Deep 3D Representations at High Resolutions. CoRR abs/1611.05009 (2016).
  • [6] Andreas Geiger, Philip Lenz, Christoph Stiller, Raquel Urtasun: Vision meets robotics: The KITTI dataset. I. J. Robotics Res. 32(11): 1231-1237 (2013).
  • [7] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, Jianxiong Xiao: 3D ShapeNets: A deep representation for volumetric shapes. CVPR 2015: 1912-1920.
  • [8] Hang Su, Subhransu Maji, Evangelos Kalogerakis, Erik G. Learned-Miller: Multi-view Convolutional Neural Networks for 3D Shape Recognition. ICCV 2015: 945-953.
  • [9] Daniel Maturana, Sebastian Scherer: VoxNet: A 3D Convolutional Neural Network for real-time object recognition. IROS 2015: 922-928.
  • [10] Charles Ruizhongtai Qi, Hao Su, Matthias Niener, Angela Dai, Mengyuan Yan, Leonidas J. Guibas: Volumetric and Multi-view CNNs for Object Classification on 3D Data. CVPR 2016:5648-5656.
  • [11] Yangyan Li, S¨oren Pirk, Hao Su, Charles Ruizhongtai Qi, Leonidas J. Guibas: FPNN: Field Probing Neural Networks for 3D Data. NIPS 2016: 307-315.
  • [12] Saurabh Gupta, Pablo Andr´es Arbel´aez, Ross B. Girshick, Jitendra Malik: Aligning 3D models to RGB-D images of cluttered scenes. CVPR 2015: 4731-4740.
  • [13] Francis Engelmann, J¨org St¨uckler, Bastian Leibe: Joint Object Pose Estimation and Shape Reconstruction in Urban Street Scenes Using 3D Shape Priors. GCPR 2016: 219-230.
  • [14] Moritz Menze, Andreas Geiger: Object scene flow for autonomous vehicles. CVPR 2015: 3061-3070.
  • [15] Rohit Girdhar, David F. Fouhey, Mikel Rodriguez, Abhinav Gupta: Learning a Predictable and Generative Vector Representation for Objects. ECCV (6) 2016: 484-499.
  • [16] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, Yoshua Bengio: Generative Adversarial Networks. CoRR abs/1406.2661 (2014).
  • [17] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, Ole Winther: Autoencoding beyond pixels using a learned similarity metric. ICML 2016: 1558-1566.
  • [18] Paul J. Besl, Neil D. McKay: A Method for Registration of 3-D Shapes. IEEE Trans. Pattern Anal. Mach. Intell. 14(2): 239-256 (1992).
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.