Check out the latest superpixel benchmark — Superpixel Benchmark (2016) — and let me know your opinion! @david_stutz


Min Lin, Qiang Chen, Shuicheng Yan. Network In Network. CoRR; 2013.

Lin et al. propose the Network in Network (often referred to as NiN) structure which is intended to replace convolutional layers by local multilayer perceptrons. The motivation is that, seen locally, a convolution is an inner product and as such can only classify linearly separable inputs. Furthermore, the recently proposed maxout networks [] compute the maximum over several feature maps. This allows to classify convex sets. Instead, Lin et al. spply small multilayer perceptrons on the receptive fields. As universal approximator this further increases the expressiveness. Concretely, network in network computes

$f_{i,j,k_1}^1 = \max(\left(w_{k_1}^1\right)^Tx_{i,j} + b_{k_1}, 0)$


$f_{i,j,k_n}^n = \max(\left(w_{k_n}^n\right)^T f_{i,j}^{n - 1} + b_{k_n}, 0)$

where $x_{i,j}$ refers to an input patch centered at location $(i,j)$. So for each patch, a multilayer perceptron with $n$ layers utilizing the ReLU activation function including biases is applied. In contrast, the maxout network computes

$f_{i,j,k} = \max_m(w_{k_m}^T x_{i,j})$

at each location, which is a regular convolutional layer where all $m$ feature maps are max pooled across the feature dimension instead of spatially.

Furthermore, they propose global average pooling. The idea is to get rid of the final fully connected layers used in most convolutional neural networks. To this end, the fully connected layers are replaced by a global average pooling layer which computes the average across each feature map. The last convolutional layer (or network in network layer in their case) computes as many feature maps as classes to classify. Then, the average pooled values are fed into a softmax layer for classification.

On several datasets, Lin et al. demonstrate the effectiveness of their model. On Cifar-10 and Cifar-100 [] they are able to improve over the state of the art. As prototype network, they use three network in network layers, which they also call mlpconv layers, each followed by a $2 \times 2$ max pooling layer. Dropout is used after the first two mlpconv layers. For classification, global average pooling is used.

  • [] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. CoRR, 2013.
  • [] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below or using the following platforms: