Meet me at CVPR'18: Tuesday, June 19th, I will be presenting our work on weakly-supervised 3D shape completion.


Fisher Yu, Vladlen Koltun. Multi-Scale Context Aggregation by Dilated Convolutions. CoRR abs/1511.07122 (2015).

Yu and Koltun use dilated convolutions, packaged in a so-called context module, to integrate multi-scale context into convolutional networks for semantic segmentation. Compared to regular, discrete convolution

$(F\star k)(p) = \sum_{s+t=p} F(s)k(t).$

where $F$ and $k$ are discrete functions, $k$ defined on a grid of size $(2r + 1)^2$, dilated convolution introduces a dilation factor:

$(F\star_l k)(p) = \sum_{s+lt=p} F(s)k(t)$

In the proposed context module, Yu and Koltun stack several layers of dilated convolution with exponentially increasing dilation factors:

$F_{i+1} = F_i \star_{2^i} k_i$ for $i = 0,1,\ldots,n-2$

This allows to exponentially increase the receptive field while preserving the resolution (in contrast to pooling or strided convolution, which reduce resolution while increasing receptive field). The idea is illustrated in Figure 1.

Figure 1: Illustration of the first few layers of a context module with exponentially increasing dilation factor. Red points illustrate the sampled pixels while the green regions illustrate the increasing receptive field.

In practice, they also discuss initialization. Concretely, they were not able to improve semantic segmentation performance using standard initialization techniques. Instead they use an identity initialization:

$k^b(t, a) = 1_{[t=0]}1{[a=b]}$

where $a$ is the index of the input feature map, $b$ the index of the output feature map (assuming that the same number of output feature maps is computed); $t$ indixes the kernel location (i.e. only the middle weight is set to 1. They also generalize this scheme to cases where input and output feature maps do not match. Let $c_i$ and $c_{i+1}$ be the number of feature maps in layer $i$ and layer $i + 1$, then

$k^b(t,a) = \begin{cases}\frac{C}{c_{i+1}} & t = 0 \text{ and } \frac{aC}{c_i} = \frac{bC}{c_{i+1}}\\ \epsilon & \text{otherwise}\end{cases}$

with $\epsilon$ being a normal random variable with variance $\sigma \ll C/c_{i + 1}$.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below or get in touch with me: