IAM

JANUARY2018

READING

Iasonas Kokkinos. UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory. CVPR, 2017.

Kokkinos presents UberNet, a network trained end-to-end to jointly approach boundary detection, surface normal estimation, saliency, semantic segmentation, semantic boundaries, human part labeling and object detection. In this sense (and to the best of my knowledge) it is the first network tackling multiple low- and high-level vision tasks simultaneously. In the paper, Kokkinos mainly discusses two problems arising from the multi-task problem:

  • Training the network on different datasets of different sizes and providing ground truth for different tasks (and only a subset of the tasks).
  • Training with a limited memory budget.

The architecture itself is fed the input image at three scales. A VGG-like architecture computes shared feature which are then fed into task-specific layers. Low-level features are fed into high layers using skip connections. Activations across scales and layers are combined in Fusion layers, which simply concatenate the features and apply a linear layer. The overall architecture can be found in the paper.

To overcome the first problem, the network uses a linear combination of task specific losses. The weights are then updated asynchronously. Note that there are task-dependent weights for each tasks as well as the shared weights. Whenever a training sample from an arbitrary training set is processed, the corresponding weight updates are stored and counters for the affected tasks are incremented. The shared weights as well as the task-dependent weights are updated whenever the corresponding counter reaches a chosen "effective" batch size. The second problem os more technical and discussed in detail in the paper.

Figure 1: Qualitative results.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.