David Eigen, Rob Fergus. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture. ICCV, 2015.

Eigen and Fergus propose to use the same multi-scale network architecture with different losses for three tasks: surface normal estimation, depth estimation and semantic segmentation. They also experiment with using the same network components for low scale for both depth and surface normal estimation. However, they do not present further experiments on jointly approaching any of the three tasks.

Figure 1: Illustration of the architecture used and exact architecture details.

The used network architecture is illustrated in in Figure 1 including details on the three used scales. Scale 1 is initialized using the AlexNet or VGG model while the other scales are initialized randomly. Scale 1 and 2 can further be trained simultaneously, and Scale 3 is trained seperately, on downscaled input to speed up training. For depth estimation,

$\mathcal{L}(D, D^*) = \frac{1}{n} \sum_i d_i^2 - \frac{1}{2n^2} \left(\sum_i d_i\right)^2 + \frac{1}{n} \sum_i [(\nabla_x d_i)^2 + (\nabla_y d_i)^2]$

is used as loss. The third term also encourages similar local structure between ground truth and estimation. This might also be useful for other tasks, but they do not experiment with the applicability of this idea to semantic segmentation. For surface normal estimation, the dot product is used as loss and for semantic segmentation cross-entropy. Interestingly, hyper-parameters for training are mostly similar (for details see the paper).

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.