Kokkinos presents UberNet, a network trained end-to-end to jointly approach boundary detection, surface normal estimation, saliency, semantic segmentation, semantic boundaries, human part labeling and object detection. In this sense (and to the best of my knowledge) it is the first network tackling multiple low- and high-level vision tasks simultaneously. In the paper, Kokkinos mainly discusses two problems arising from the multi-task problem:
The architecture itself is fed the input image at three scales. A VGG-like architecture computes shared feature which are then fed into task-specific layers. Low-level features are fed into high layers using skip connections. Activations across scales and layers are combined in Fusion layers, which simply concatenate the features and apply a linear layer. The overall architecture can be found in the paper.
To overcome the first problem, the network uses a linear combination of task specific losses. The weights are then updated asynchronously. Note that there are task-dependent weights for each tasks as well as the shared weights. Whenever a training sample from an arbitrary training set is processed, the corresponding weight updates are stored and counters for the affected tasks are incremented. The shared weights as well as the task-dependent weights are updated whenever the corresponding counter reaches a chosen "effective" batch size. The second problem os more technical and discussed in detail in the paper.