Check out our CVPR'18 paper on weakly-supervised 3D shape completion — and let me know your opinion! @david_stutz


David Eigen, Rob Fergus. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture. ICCV, 2015.

Eigen and Fergus propose to use the same multi-scale network architecture with different losses for three tasks: surface normal estimation, depth estimation and semantic segmentation. They also experiment with using the same network components for low scale for both depth and surface normal estimation. However, they do not present further experiments on jointly approaching any of the three tasks.

Figure 1: Illustration of the architecture used and exact architecture details.

The used network architecture is illustrated in in Figure 1 including details on the three used scales. Scale 1 is initialized using the AlexNet or VGG model while the other scales are initialized randomly. Scale 1 and 2 can further be trained simultaneously, and Scale 3 is trained seperately, on downscaled input to speed up training. For depth estimation,

$\mathcal{L}(D, D^*) = \frac{1}{n} \sum_i d_i^2 - \frac{1}{2n^2} \left(\sum_i d_i\right)^2 + \frac{1}{n} \sum_i [(\nabla_x d_i)^2 + (\nabla_y d_i)^2]$

is used as loss. The third term also encourages similar local structure between ground truth and estimation. This might also be useful for other tasks, but they do not experiment with the applicability of this idea to semantic segmentation. For surface normal estimation, the dot product is used as loss and for semantic segmentation cross-entropy. Interestingly, hyper-parameters for training are mostly similar (for details see the paper).

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below or get in touch with me: