DAVIDSTUTZ

27thSEPTEMBER2017

Tobias Pohlen, Alexander Hermans, Markus Mathias, Bastian Leibe. Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes. CoRR abs/1611.08323 (2016).

Pohlen et al. introduce Full Resolution Residual Networks (FRRN) for semantic segmentation of urban street scenes, e.g. on Cityscapes []. The proposed network architecture is based in large parts on the idea of residual units []. The proposed full resolution residual unit is an extension based on the following observations: current state-of-the-art deep networks for semantic segmentation are often based on pre-trained models that excel in recognition performance but lack localization performance. To improve localization performance, low-level features are crucial, but these are often neglected or lost in traditional architectures based on several pooling layers. Therefore, Pohlen et al. propose a network architecture based on two streams, a residual stream that successively applies full resolution residual units on the full resolution image, and a pooling scheme that follows a traditional encoder/decoder architecture [] based on several pooling stages. The latter is supposed to learn the high-level features while the former provides low-level features for better localization performance (i.e. more accurate boundaries).

The residual unit is illustrated in Figure 1 and can generally be described as computing

$x_n = x_{n - 1} + \mathcal{F}(x_{n-1};\mathcal{W}_n)$

where $x_n$ is the output of layer $n$ and $\mathcal{W}_n$ represents the parameters in layer $n$, i.e. the layer is only responsible for representing a residual. The idea is to improve training by making the gradient partly independent of the depth, see the paper for details. A full resolution residual unit takes the form depicted in Figure 2 and computes

$z_n = z_{n - 1} + \mathcal{H}(y_{n - 1},z_{n - 1};\mathcal{W}_n)$

$y_n = \mathcal{G}(y_{n - 1}, z_{n - 1}; \mathcal{W}_n)$

where $z_n$ is the output of the residual stream at layer $n$, $y_n$ the output of the pooling stream at layer $n$, and $\mathcal{W}_n$ the weights of layer $n$. The full resolution residual network is implemented as in Figure 3. In particular, the residual input $z_{n-1}$ is first pooled to reduce its size, then two convolutional layers including a batch normalization layer and a rectified linear unit layer each follow. Finally, the residual output is up-scaled using an unpooling layer.

The overall architecture (they evaluated two architectures, called FRRN A and B, respectively) are summarized in Figure 4. Training is done using adam and the bootstrap cross entropy loss [], see the paper for details.

The presented results look promising — especially as no pre-training is necessary. They achieve state-of-the-art performance while training FRRN B on half resolution only and up-scaling the results using bilinear interpolation. Qualitative results are shown in Figure 5. Also check out Tobias Pohlen's webpage for details and source code.

Figure 5: Qualitative results.

• [] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The Cityscapes Dataset for Semantic Urban Scene Understanding. In CVPR, 2016.
• [] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016.
• [] H. Noh, S. Hong, and B. Han. Learning Deconvolution Network for Semantic Segmentation. In ICCV, 2015.
• [] Z. Wu, C. Shen, and A. v. d. Hengel. Bridging Category level and Instance-level Semantic Image Segmentation. arXiv:1605.06885, 2016.