Zeng et al. introduce 3DMatch, a 3D convolutional neural network for 3D keypoint matching. From my viewpoint — as I am mostly interested in their general approach of 3D convolutional neural networks — the network architecture is quite simple, see Figure 1. A network is used to learn feature representations, where several 3D convolutional layers combined with ReLU activations and a pooling layer are used. The metric learning network on top learns two outputs corresponding to “match” and “non-match”.
Considerable effort must have gone into setting up the dataset and evaluation pipeline. They sample 3D keypoint correspondences from 3D Harris points in 3D scenes captured e.g. with Microsoft’s Kinect or Asus’ Xtion. To obtain ground truth, different viewpoints and video trajectories from the same scene are aligned using recent results in reconstruction . Using this scheme, they are able to generate a large dataset for learning. They use a truncated distance field representation for the 3D volumes; after obtaining two matching (or non-matching keypoints), $31 \times 31 \times 31$ volumes are extracted which are fed to the feature learning network. These volumes correspond roughly to a 15cm vicinity of the keypoints.