Novotny et al. introduce VpDR-Net, a network for joint viewpoint and depth estimation as well as shape completion by predicting completed point clouds. The high-level approach is highlighted in Figure. For details on the individual networks, I recommend reading the supplementary material of the paper. The proposed network basically consists of 3 components: a viewpoint estimation network, a depth estimation network, and a shape completion/point cloud completion network.
An important aspect to discuss is the input to the network. As illustrated in Figure 1, given a sequence of images (i.e. a video) of an object, structure-from-motion is used to estimate a 3d reconstruction point cloud. If RGBD frames are available, KinectFusion  is used instead of structure-from-motion. Thus, the network gets as input the images including estimated depth, the estimated point cloud (for supervision) and the estimated egomotion.
The viewpoint estimation network shares its weights with the depth estimation network. The architecture is a siamese network where each of the two streams follows the ResNet50  architecture followed by a hypercolum descriptor module  which directly predicts depth and viewpoint. The input to the siamese network are only two images of different frames. In order to predict viewpoints, there are several problems. First, in order to train the network, no absolute viewpoints are available – instead, viewpoints are only valid within individual sequences. However, Novotny et al. Intend to predict the absolute, sequence-independent viewpoint. Therefore, they supervise the network by relative viewpoint changes between frames within sequences – details can be found in the paper. Both for the viewpoint and the depth prediction, they introduce an uncertainty measure within the predictions. In particular, they let the network predict both the viewpoint, as well as the associated uncertainty encoded by a variance. Translation is then modeled by a Gaussian – such that the negative log-likelihood can be optimized; rotation is modeled as Laplacian. Note that the negative log-likelihoods are then governed by the predicted variance, and the loss (i.e. error) between prediction and ground truth. For depth, a similar approach is taken – depth is also modeled as Laplacian.
Finally, the point cloud completion network is based on PointNet . In particular, it takes as input a partial point cloud obtained from the predicted viewpoints and depth maps, and predicts $M$ 3D points. These points should fit the ground truth point cloud as closely as possible – i.e. by minimizing the sum-of-squared error between nearest neighbors. Based on the 3D points, they also predict a probability distribution. For each point, a scalar is predicted indicating the fraction of ground truth points lying closest to the chosen point. The point cloud completion network is depicted in Figure 2.
The full system is trained in two stages, first the viewpoint and depth networks are optimized jointly, then the point cloud completion network is trained. Qualitative results of the approach can be seen in Figure 3. Note that at test time, the system allows to predict the shape (as point cloud) and the view point from a single image.
Figure 3: Qualitative results. Top row: input images, bottom rows: completed point cloud and predicted view point seen from two different sides.