Xu and Corso implement and evaluate several supervoxel algorithms including the approach proposed by Grundmann et al. [1]. Evaluation is based on 3D variants of Undersegmentation Error, Achievable Segmentation Accuracy and Boundary Recall. Both the benchmark as well as the implementations are published as part of a library called libsvx which is available at the project's webpage.
Note that the used formulation of the Undersegmentation Error of a supervoxel segmentation $S = \{S_i\}$, $S_i \subseteq \{1,\ldots,H\} \times \{1,\ldots,W\} \times \{1,\ldots, T\} =: V$, with respect to a ground truth segmentation $G = \{G_j\}$, $G_j \subseteq V$, defined as
is not constrained to lie in $[0,1]$. Therefore, the results are hard to interpret or compare across datasets - or even across different video sequences. A generalization of the formulation given by Neubert and Protzel [2] to video sequences seems more appropriate.
[1] M. Grundmann, V. Kwatra, M. Han, I. Essa. Efficient Hierarchical Graph Based Video Segmentation. Conference on Computer Vision and Pattern Recognition, 2010.
[2] P. Neubert, P. Protzel. Superpixel benchmark and comparison. Forum Bildverarbeitung, 2012.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.
Xu and Corso implement and evaluate several supervoxel algorithms including the approach proposed by Grundmann et al. [1]. Evaluation is based on 3D variants of Undersegmentation Error, Achievable Segmentation Accuracy and Boundary Recall. Both the benchmark as well as the implementations are published as part of a library called libsvx which is available at the project's webpage.
Note that the used formulation of the Undersegmentation Error of a supervoxel segmentation $S = \{S_i\}$, $S_i \subseteq \{1,\ldots,H\} \times \{1,\ldots,W\} \times \{1,\ldots, T\} =: V$, with respect to a ground truth segmentation $G = \{G_j\}$, $G_j \subseteq V$, defined as
$UE(S,G) = \frac{1}{|G|} \sum_{G_j \in G} \frac{\left(\sum_{S_i \cap G_j \neq \emptyset} |S_i|\right) - |G_j|}{|G_j|}$
is not constrained to lie in $[0,1]$. Therefore, the results are hard to interpret or compare across datasets - or even across different video sequences. A generalization of the formulation given by Neubert and Protzel [2] to video sequences seems more appropriate.