Andy Zeng, Shuran Song, Matthias Nießner, Matthew Fisher, Jianxiong Xiao, Thomas A. Funkhouser. 3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions. CVPR, 2017.

Zeng et al. introduce 3DMatch, a 3D convolutional neural network for 3D keypoint matching. From my viewpoint — as I am mostly interested in their general approach of 3D convolutional neural networks — the network architecture is quite simple, see Figure 1. A network is used to learn feature representations, where several 3D convolutional layers combined with ReLU activations and a pooling layer are used. The metric learning network on top learns two outputs corresponding to “match” and “non-match”.

Figure 1: Network architecture used for feature and metric learning.

Considerable effort must have gone into setting up the dataset and evaluation pipeline. They sample 3D keypoint correspondences from 3D Harris points in 3D scenes captured e.g. with Microsoft’s Kinect or Asus’ Xtion. To obtain ground truth, different viewpoints and video trajectories from the same scene are aligned using recent results in reconstruction [9]. Using this scheme, they are able to generate a large dataset for learning. They use a truncated distance field representation for the 3D volumes; after obtaining two matching (or non-matching keypoints), $31 \times 31 \times 31$ volumes are extracted which are fed to the feature learning network. These volumes correspond roughly to a 15cm vicinity of the keypoints.

Figure 2: Visualization of the learned feature space using t-SNE [33].

  • [9] Angela Dai, Matthias Nießner, Michael Zollhöfer, Shahram Izadi, Christian Theobalt. BundleFusion: Real-Time Globally Consistent 3D Reconstruction Using On-the-Fly Surface Reintegration. ACM Trans. Graph. 36(3): 24:1-24:18 (2017).
  • [33] L. Van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 2008.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.