Video Segmentation

Unsupervised video segmentation inherits many of its difficulties from image segmentation where it has been shown [1] that even humans have immense problems choosing the optimal number of segments and granularity of the segmentation. In contrast, video provides valuable temporal information to guide segmentation. Still, most state-of-the-art approaches generate oversegmentations where the video volume is segmented into many so called supervoxels [2]. Some authors [3] build hierarchies of segmentations by iteratively merging these supervoxels. However, choosing the optimal hierarchy level is difficult and no metrics are available to guide this decision.

This article is intended to give a brief overview over state-of-the-art video segmentation algorithms, evaluation benchmarks and datasets.

Video segmentation takes different forms and, therefore, highly varying approaches have been proposed. The earliest publications on video segmentation use mean-shift approaches [4], [5]. Later, Brendel a Todorovic [6] proposed to track regions over time. Lee et al. [7], Li et al. [8] as well as Papazoglou and Ferrari [9] present approaches for figure ground segmentation. Other approaches are motivated by superpixel algorithms such as [10], [11] or [12] and introduce generalizations to video [3], [11], [13]. These algorithms are commonly referred to as supervoxel algorithms, however, the terms superpixel and supervoxel are not clearly defined throughout the literature. Galasso et al. [14] extend the algorithm of [15] to include motion cues and a streaming version of this approach was introduced in [16]. Another streaming video segmentation algorithm, based on the approach by Grundmann et al. [3], was presented by Xu et al. [17].

Datasets

Proper datasets are essential for evaluating and comparing video segmentation algorithms. In the following, the most important datasets used for (semantic) video segmentation are presented. A short overview over additional datasets can also be found in [18].

Grundmann et al. [3] provide a set of 15 sequences of varying length and scenes, see here. However, no ground truth annotation is provided such that the dataset can only be used for qualitative evaluation and most of the measures described later are not applicable.
Chen and Corso [19] provide a semantically labeled dataset with 8 sequences, see here. The sequences are annotated according to 24 classes such as "building", "grass" or "body". Example sequences are shown in figure 12. Further, Chen and Corso [CC11] provide another dataset, see here of 380 frames split up into 10 sequences and annotated by 4 classes ("vehicles", "road", "obstacles" and "others").
In [20], Xu and Corso use the SegTrack dataset by Tsai et al. [21], consisting of 6 sequences to evaluate supervoxel algorithms, see here. However, the dataset merely provides figure-ground annotations.
The Sintel dataset published by Butler et al. [22] provides 23 synthetic sequences taken from the open movie "Sintel". Ground truth annotation is derived from material properties of the sequences. As result, the ground truth cannot be considered semantic, however, still includes non-connected segments.
Liu et al. [23] provide the Wild8 dataset (which is currently not publicly available, however, can be obtained from the authors), consisting of 100 sequences of which 33 are semantically labeled. The sequences are taken from documentaries and segmented into 8 classes such as "water", "sky", "bird" or "lion".
Galasso et al. [18] provide the VSB100 dataset, see here, of short sequences as part of their video segmentation benchmark. They provide 40 training sequences and 60 test sequences with 11-15 frames per training sequence and 3-8 frames per test sequence. The sequences are taken from the Berkeley Video Dataset by Sundberg et al. [24].

An important drawback of most of the presented datasets is their semantic annotation as segments are not required to be connected. This may influence evaluation using some of metrics discussed later. For converting the annotations a three-dimensional connected components algorithm could be used.

Figure 1 (click to enlarge): Examples of video sequences from the presented datasets. From top to bottom: Grundmann et al. [3]; Chen and Corso [19]; Chen and Corso ground truth; Tsai et al. [21]; Tsai et al. ground truth; Sintel [22]; Sintel ground truth; Wild8 [23]; Wild8 ground truth; VSB100 [18] and VSB100 ground truth.

Benchmarks

To the best of my knowledge, two publications comparing state-of-the-art video segmentation algorithms are available: Xu and Corso [20] as well as Galasso et al. [18]. In both, the authors propose to generalize common metrics used for image segmentation:

Xu and Corso [20] - as well as Galasso et al. [18] - propose to generalize Boundary Recall to video volumes. Given a ground truth segmentation $G = \{G_i\}$ and a segmentation $S = \{S_j\}$, both partitions of the set of all pixels, 3D Boundary Recall is defined as
$3DRec(G, S) = \frac{TP(G, S)}{TP(G, S) + FN(G, S)}$
where $TP(G, S)$ is the number of true positive boundary pixels and $FN(G, S)$ the number of false positive boundary pixels. Thus, 3D Boundary Recall measures the fraction of correctly detected boundary pixels. Boundary pixels are identified both spatially and temporally, however, Xu and Corso [20] as well as Galasso et al. [18] do not distinguish between spatial and temporal boundary pixels. As consequence, temporal boundary pixels in $S$ may be matched to temporal boundary pixels in $G$ and vice versa.
Furthermore, Xu and Corso [20] propose the following formulation of the 3D Undersegmentation Error:
$3DUE_{XC}(G, S) = \frac{1}{|G|} \sum_{G_i \in G} \frac{\left(\sum_{S_j \cap G_i \neq \emptyset} |S_j|\right) - |G_i|}{|G_i|}$.
As this formulation of the 3D Undersegmentation Error is not constrained to lie in $[0,1]$, comparison across datasets is difficult. Therefore, generalization of the formulation by Neubert and Protzel [25] may be advantageous:
$3DUE_{NP}(G, S) = \frac{1}{\sum_{G_i \in G} |G_i|} \sum_{G_i \in G} \sum_{S_j \cap G_i \neq \emptyset} \min\{|S_j \cap G_i|, |S_j - G_i|\}$.

References

[1] P. Arbelaez, M. Maire, C. Fowlkes, J. Malik. Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis Machine Intelligence, 33(5):898–916, May 2011.
[2] C. Xu, J. J. Corso. Evaluation of super-voxel methods for early video processing. Conference on Computer Vision and Pattern Recognition, pages 1202– 1209, Providence, RI, June 2012.
[3] M. Grundmann, V. Kwatra, M. Han, I. A. Essa. Efficient hierarchical graph-based video segmentation. Conference on Computer Vision and Pattern Recognition, pages 2141–2148, San Francisco, June 2010.
[4] J. Wang, B. Thiesson, Y. Xu, M. F. Cohen. Image and video segmentation by anisotropic kernel mean shift. European Conference on Computer Vision, volume 3022 of Lecture Notes in Computer Science, pages 238–249, Prague, Czech Republic, May 2004. Springer.
[5] D. Freedman, P. Kisilev. Fast mean shift by compact density representation. Conference on Computer Vision and Pattern Recognition, pages 1818–1825, Miami, FL, June 2009.
[6] W. Brendel and S. Todorovic. Video object segmentation by tracking regions. International Conference on Computer Vision, pages 833–840, Kyoto, Japan, September 2009.
[7] Y. J. Lee, J. Kim, and K. Grauman. Key-segments for video object segmentation. International Conference Computer Vision, pages 1995–2002, Barcelona, November 2011.
[8] F. Li, T. Kim, A. Humayun, D. Tsai, J. M. Rehg. Video segmentation by tracking many figure-ground segments. International Conference on Computer Vision, pages 2192–2199, Sydney, Australia, December 2013.
[9] A. Papazoglou an V. Ferrari. Fast object segmentation in unconstrained video. International Conference on Computer Vision, pages 1777–1784, Sydney, Australia, December 2013.
[10] P. F. Felzenszwalb, D. P. Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2):167-181, September 2004.
[11] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. Süsstrunk. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis Machine Intelligence, 34(11):2274–2282, 2012.
[12] M. Van den Bergh, X. Boix, G. Roig, and L. J. Van Gool. SEEDS: superpixels extracted via energy-driven sampling. Computing Research Repository, abs/1309.3848, 2013.
[13] M. Van den Bergh, G. Roig, X. Boix, S. Manen, L. J. Van Gool. Online video SEEDS for temporal window objectness. International Conference on Computer Vision, pages 377–384, Sydney, Australia, December 2013.
[14] F. Galasso, R. Cipolla, B. Schiele. Video segmentation with superpixels. Asian Conference on Computer Vision, volume 7724 of Lecture Notes in Computer Science, pages 760–774, Daejeon, Korea, November 2012. Springer.
[15] P. Arbelaez, M. Maire, C. C. Fowlkes, J. Malik. From contours to regions: An empirical evaluation. Conference on Computer Vision and Pattern Recognition, pages 2294–2301, Miami, FL, June 2009.
[16] F. Galasso, M. Keuper, T. Brox, B. Schiele. Spectral graph reduction for efficient image and streaming video segmentation. Conference on Computer Vision and Pattern Recognition, pages 49–56, Columbus, OH, 2014.
[17] C. Xu, C. Xiong, J. J. Corso. Streaming hierarchical video segmentation. European Conference on Computer Vision, pages 626–639, October 2012.
[18] F. Galasso, N. S. Nagaraja, T. J. Cardenas, T. Brox, B. Schiele. A unified video segmentation benchmark: Aannotation, metrics and analysis. International Conference on Computer Vision, pages 3527–3534, Sydney, Australia, December 2013.
[19] A. Y. C. Chen, J. J. Corso. Propagating multi-class pixel labels throughout video frames. In Image Processing Workshop, pages 14–17, November 2010.
[20] C. Xu and J. J. Corso. Evaluation of super-voxel methods for early video processing. In Computer Vision and Pattern Recognition, Conference on, pages 1202– 1209, Providence, RI, June 2012.
[21] D. Tsai, M. Flagg, J. M. Rehg. Motion coherent tracking with multi-label MRF optimization. In British Machine Vision Conference, pages 1–11, Aberystwyth, UK, September 2010.
[22] D. J. Butler, J. Wulff, G. B. Stanley, M. J. Black. A naturalistic open source movie for optical flow evaluation. In Computer Vision, Conference on, volume 7577 of Lecture Notes in Computer Science, pages 611–625. Springer, October 2012.
[23] X. Liu, D. Tao, M. Song, Y. Ruan, C. Chen, J. Bu. Weakly supervised multiclass video segmentation. In Computer Vision and Pattern Recognition, Conference on, pages 57–64, Columbus, OH, June 2014.
[24] P. Sundberg, T. Brox, M. Maire, P. Arbelaez, J. Malik. Occlusion boundary detection and figure/ground assignment from optical flow. In Computer Vision and Pattern Recognition, Conference on, pages 2233–2240, Colorado Springs, CO, June 2011.
[25] P. Neubert, P. Protzel. Superpixel benchmark and comparison. In Forum Bildverarbeitung, Regensburg, Germany, November 2012.

IAM

DAVIDSTUTZ

ARTICLE