# DAVIDSTUTZ

Meet me at CVPR'18: Tuesday, June 19th, I will be presenting our work on weakly-supervised 3D shape completion.

## Reviews and Rebuttal for “Superpixels: An Evaluation of the State-of-the-Art”

This article summarizes the reviews corresponding to our paper “Superpixels: An Evaluation of the State-of-the-Art”. The paper was accepted for publication in Computer Vision and Image Understanding. The reviews correspond to v2 on ArXiv. The updated version will be made available on ArXiv.

The following paragraphs discuss questions raised by the reviewers; the questions are discussed in no particular order. At this point, I want to thank both reviewers for the constructive feedback and Alexander Hermans for his help regarding the rebuttal. All changes discussed below will be integrated into the final version.

The paper (both the current and future versions) is available on ArXiv:

It was pointed out, that the names Average Recall (ARec) and Average Explained Variation (AEV) are misleading.

We decided to change the names of these metrics. Since we intend to display all three proposed metrics (ARec, AEV and Average Undersegmentation Error (AUE)) side-by-side, we wanted to formulate these metrics in a "lower is better" fashion with lower bound $0$. This way, the experimental results are easy to communicate and no normalization is needed. Overall, we renamed ARec to Average Miss Rate (AMR) and AEV to Average Unexplained Variation (AUV). The reasoning is that $\text{MR} = (1 − \text{Rec})$ describes the Boundary Miss Rate (MR), and $\text{UV} = (1 − \text{EV})$ describes the Unexplained Variation (UV).

Several aspects of parameters optimization are unclear; this includes the following aspects:

• We do not mention for which values of $K$ (= number of superpixels) parameters were optimized;
• The choice of the objective, i.e. $(1 - \text{Rec}) + \text{UE}$, is not justified (here Rec denotes Boundary Recall and UE refers to Undersegmentation Error);
• It is unclear whether compactness parameters (if applicable) are included in parameter optimization.

Parameter optimization was performed on the training set of each dataset for $K \in \{400, 1200, 3600\}$. The obtained parameters were interpolated linearly in between. However, we found that the parameters for most algorithms are consistent across different values of $K$ (also see here for detailed parameters). Regarding the objective, we chose a additive formulation to avoid algorithms concentrating only on one metric (usually Rec) while neglecting the other (usually UE). For example, considering a multiplicative formulation (i.e. $(1 - \text{Rec})\cdot \text{UE}$), we found that some algorithms are capable of driving $(1 - \text{Rec})$ towards zero — while $\text{UE}$ is bounded below by a fixed constant on all datasets. In an additive formulation, this is not as problematic as optimizing purely Rec will give diminishing returns forcing algorithms to consider UE, too. On a different note, other objectives used in image segmentation, for example the F-score (harmonic mean of Boundary Recall, Rec, and Boundary Precision, Pre), is not meaningful as superpixels usually exhibit very low Pre. Finally, optimization was performed including all parameters — this also includes compactness parameters (if applicable), the number of iterations (if applicable) and the color space.

We do not explicitly discuss the generation of the ranking table (Table 3).

The ranking table presents Average Recall (ARec, as of the discussion above renamed as AMR) and Average Undersegmentation Error (AUE) averaged over all datasets for each algorithm (columns "ARec" and "AUE"). Then, for each dataset, algorithms were ranked according to $\text{ARec} + \text{AUE}$ where the lowest value corresponds to the best rank, i.e. rank one. The average rank (column "Rank") is reported and the remaining columns show how often an algorithm obtained a specific rank across all datasets.

The categorization of algorithms in Section 3 is not motivated or justified in detail.

The categorization is based on [] — to the best of our knowledge the only categorization of superpixel algorithms. The categorization was extended to fit the additional evaluated algorithms by describing the high-level approaches. The motivation of the categorization is to provide the reader with a rough understanding of the algorithms while being detailed enough to discuss experimental results (i.e. relating experimental results to individual categories).

We claim that the choice of datasets is representative for many vision tasks. However, we do not provide references or discuss limitations.

With the chosen datasets we focus on natural images and believe that the presented experimental results are relevant for applications (e.g. [][][][][][][]) and might also be useful for similar but larget datasets (such as PASCAL VOC [], ImageNet [] or []). However, we also acknowledge that the chosen datasets are not representative for specialized research areas such as medical imaging.

In Section 7.4, we did not discuss the evaluated implementations in details.

We note that reSEEDS as well as reFH are implementations by us closely following the corresponding publications. Both are implemented in C/C++ as their original counterparts and were optimized regarding runtime and connectivity. The other evaluated implementations, vlSLIC and preSLIC, were not done by us. vlSLIC is an implementation as part of the VLFeat library and preSLIC is based on the original implementation. However, an exact discussion lies not in the scope of Section 7.4, therefore we refer to the implementations available on GitHub.

We oversaw a corrected formulation of Undersegmentation Error discussed in [].

In [], the corrected Undersegmentation Error is defined as follows:

$\text{CUE}(G, S) = \frac{1}{N}\sum_{S_j} |S_j - \arg \max_{G_i}|S_j \cap G_i||$

where $\{S_j\}_{j = 1}^K$ denotes a superpixel segmentation, $\{G_i\}$ a ground truth segmentation (with $S_j$, $G_i$ being disjoint set of pixels) and $N$ the number of pixels. First of all, easy calculation shows that

$\text{CUE}(G, S) = (1 - \text{ASA}(G,S))$

with ASA being Achievable Segmentation Accuracy as defined in Section 5.

We were pointed to [] where it is reported that depth information helps to improve superpixel performance. This is in contrast to the experiments presented in our paper.

Table 3 in [] reports increased performance using a SLIC variant using depth information. However, the presented experiments are not comparable to ours. First, different algorithms are evaluated. And these algorithms also use different features. For example, the SLIC variant in [] uses the depth gradient as feature, while DASP and VCCS (evaluated in our paper) use surface normals and 3D coordinates. Second, different evaluation metrics were used. And finally, the experiments in [] are constrained to $K \approx 200$ superpixels. Overall, the effect of depth information might be more pronounced in the experiments in [] compared to ours.

There are two implementations of SEEDS available; one as part of OpenCV and the original by the authors. It may be unclear why we state that SEEDS does not provide a compactness parameter and the presented runtime is considerably higher that the runtime of the OpenCV implementation.

First, SEEDS, as described in [], provides both a compactness parameter as well as a so-called smoothness prior. After revisiting both publicly available implementations, we found that no implementation provides the compactness parameter. However, the smoothness prior only considers local smoothness of the superpixel boundaries. In our paper, we consider compactness and smoothness separately and, thus, do not consider the smoothness prior as a compactness term. Second, we used the original implementation of SEEDS. In contrast to the OpenCV implementation, this results in higher runtime. This is also a good example of the discussion in Section 7.4 where we show that revisiting implementations may lead to reduced runtime.

• [] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. Süsstrunk. SLIC superpixels compared to state-of-the-art superpixel methods. PAMI, 2012.
• [] S. He, R. W. H. Lau, W. Liu, Z. Huang, Q. Yang. SuperCNN: A superpixelwise convolutional neural network for salient object detection. IJCV, 2015.
• [] P. A. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marqués, J. Malik. Multiscale combinatorial grouping. CVPR, 2014.
• [] M. Liu, M. Salzmann, X. He. Discrete-continuous depth estimation from a single image. CVPR, 2014.
• [] C. D. C. Lerma, J. Koseck. Semantic segmentation with heterogeneous sensor coverages. ICRA, 2014.
• [] D. Lin, S. Fidler, R. Urtasun. Holistic scene understanding for 3d object detection with RGBD cameras. ICCV, 2013.
• [] S. Gupta, P. A. Arbeláez, R. B. Girshick, J. Malik. Indoor scene understanding with RGB-D images: Bottom-up segmentation, object detection and semantic segmentation IJCV, 2015.
• [] K. Yamaguchi, K. M. H, L. E. Ortiz, T. L. Berg. Parsing clothing in fashion photographs. CVPR, 2012.
• [] L. M. Everingham, L. van Gool, C. K. I. Williams, J. Winn, A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.
• [] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. CVPR, 2009.
• [] T.-Y. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, C. L. Zitnick. Microsoft COCO: common objects in context. CoRR, 2014.
• [] M. Van den Bergh, X. Boix, G. Roig, L. J. Van Gool. SEEDS: Superpixels Extracted Via Energy-Driven Sampling. IJCV, 2015.
• [] J. Zhang, C. Kan, A. G. Schwing, R. Urtasun. Estimating the 3D Layout of Indoor Scenes and Its Clutter from Depth Sensors. ICCV, 20130
• .

What is your opinion on this article? Did you find it interesting or useful? Let me know your thoughts in the comments below or get in touch with me: