Efficient Hierarchical Graph-Based Video Segmentation: An Implementation

Introduction

After discussing state-of-the-art video segmentation algorithms as well as used datasets and benchmarks, this article is intended to present an implementation of the hierarchical video segmentation algorithms poposed by Grundmann et al. [1]. The implementation is available on GitHub:

GitHub

Hierarchical Graph-Based Video Segmentation

The approach is based on the algorithm by Felzenswalb and Huttenlocher [2] which is frequently used for both image segmentation and superpixel segmentation. Details and examples can be found in the following box (click to expand):

P. F. Felzenswalb and D. P. Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision, volume 59, number 2, 2004.

After introducing SEEDS [1] and SLIC [2], this article focusses on another approach to superpixel segmentation proposed by Felzenswalb & Huttenlocher. The procedure is summarized in algorithm 1 and based on the following definitions. Given an image $I$, $G = (V, E)$ is the graph with nodes corresponding to pixels, and edges between adjacent pixels. Each edge $(n,m) \in E$ is assigned a weight $w_{n,m} = \|I(n) - I(m)\|_2$. Then, for subsets $A,B \subseteq V$ we define $MST(A)$ to be the minimum spanning tree of $A$ and

$Int(A) = max_{n,m \in MST(A)} \{w_{n,m}\}$,

$MInt(A,B) = min\{Int(A) + \frac{\tau}{|A|}, Int(B) + \frac{\tau}{|B|}\}$.

where $\tau$ is a threshold parameter and $MInt(A,B)$ is called the minimum internal difference between components $A$ and $B$. Starting from an initial superpixel segmentation where each pixel forms its own superpixel the algorithm processes all edges in order of increasing edge weight. Whenever an edge connects two different superpixels, these are merged if the edge weight is small compared to the minimum internal difference.

function fh(
        $G = (V,E)$ // Undirected, weighted graph.
    )
    sort $E$ by increasing edge weight
    let $S$ be the initial superpixel segmentation
    for $k = 1,\ldots,|E|$
        let $(n,m)$ be the $k^{\text{th}}$ edge
            if the edge connects different superpixels $S_i,S_j \in S$
                if $w_{n,m}$ is sufficiently small compared to $MInt(S_i,S_j)$
                    merge superpixels $S_i$ and $S_j$
    return $S$

Algorithm 1: The superpixel algorithm proposed by Felzenswalb & Huttenlocher.

The original implementation can be found on Felzenswalb's website. Generated superpixel segmentations can be found in figure 1.

Figure 1: Images from the Berkeley Segmentation Dataset [3] oversegmented using the algorithm proposed by Felzenswalb & Huttenlocher.

[1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. Süsstrunk. SLIC superpixels. Technical report, École Polytechnique Fédérale de Lausanne, 2010.
[2] M. van den Bergh, X. Boix, G. Roig, B. de Capitani, L. van Gool. SEEDS: Superpixels extracted via energy-driven sampling. European Conference on Computer Vision, pages 13–26, 2012.
[3] P. Arbeláez, M. Maire, C. Fowlkes, J. Malik. Contour detection and hierarchical image segmentation. Transactions on Pattern Analysis and Machine Intelligence, volume 33, number 5, pages 898–916, 2011.

Interpreting a video as 6-connected, weighted graph where the weights are originally defined as $L_1$ or $L_2$ distance in color space, Felzenswalb and Huttenlocher [2] argue that edges within segments should have low weights compared to edges between segments. This intuition results in the following definition of the internal difference of a segment $S_j$ (here, $S = \{S_j\}$ is a partitioning of all pixels in the video):

$Int(S_j) = \max_{(n.m) \in MST(S_j)}\{w_{n,m}\}$

where $w_{n,m}$ is the weight assigned to pixel pair $(x_n, x_m)$ and $MST(S_j)$ denotes the minimum spanning tree of segment $S_j$ (which is required to represent a connected set of pixels). Further, Felzenswalb and Huttenlocher [2] define the minimum internal difference between segments $S_j$ and $S_k$ as

$MInt(S_j, S_k) = \min\left\{Int(S_j) + \frac{\tau}{|S_j|}, Int(S_k) + \frac{\tau}{|S_k|}\right\}$.

where $\tau$ is a threshold parameter. An oversegmentation is then generated by processing all edges in order of increasing edge weight and merging the corresponding segments whenever the edge weight is smaller than the minimum internal difference between the two segments. Afterwards, both [2] and [1] enforce a minimum segment size.

Based on the generated oversegmentation, the same algorithm is applied on the so called region graph [1]. For this purpose, color histograms are computed and the edge weight of neighboring segments is given by the $\chi^2$ distance:

$\chi^2 (h_{S_j}, h_{S_k}) = \sum_{q = 1}^Q \frac{(h_{S_j}(q) - h_{S_k}(q))^2}{h_{S_j}(q) + h_{S_k}(q)}$

where $h_{S_j}$ is the $Q$-bin histogram of segment $S_j$. By successively increasing the threshold $\tau$ a hierarchy of segmentations is generated.

In order to make use of temporal information, Grundmann et al. [1] use dense optical flow to temporally connect pixels in the initial 6-connected graph. Further, flow orientation histograms can be computed to refine the edge weights of the region graph.

Implementation

The implementation is based on OpenCV and Boost - detailed building instructions can be found here. Two command line tools are provided for running the video segmentation algorithm and computing optical flow. Usage instructions and parameters can be found here.

The following code snippet is intended to demonstrate the usage of the algorithm:

// The video is expected to be provided as sequence of images, see alley_1 as example.
// The same holds for the optical flow, which is provided using text files formatted
// according to cv::FileStorage, see io.h.
boost::filesystem::path in_dir("path/to/video");
boost::filesystem::path flow_dir("path/to/flow");

// Length of the sequnce to read (may be smaller than the actual length).
int length = 10;
Video video = IO::readVideo(in_dir, length);
FlowVideo flowVideo = IO::readFlowVideo(flow_dir, length);

assert(video.getFrameNumber() > 0);
assert(video.getFrameNumber() == flowVideo.getFrameNumber());

// Parameters:
int M = 300; // Minimum segment size, enforced after segmentation.
int L = 40; // Number of hierarchy levels.
float c = 0.02; // Threshold used for the initial oversegmentation.
float beta = 0.25; // Importance of flow information for edge weight computation.
float alpha = 1 - beta; // ... importance of color.

GraphSegmentationMagic* magic = new GraphSegmentationMagicThreshold(c);
GraphSegmentationDistance* distance = new GraphSegmentationEuclideanRGBFlowAngle(alpha, beta);

GraphSegmentation segmenter(distance, magic);

// Setup the video graph.
segmenter.buildGraph(video, flowVideo);
segmenter.buildEdges();    

// Oversegment the video.
segmenter.oversegmentGraph();

// Enforce minimum segment size.
segmenter.enforceMinimumSegmentSize(M);

// Get the corresponding segmentation video (i.e. the labels).
// The labels per frame are encoded in a three-channel image 
// as 24 bit numbers, see io_util.h
SegmentationVideo sv_video = segmenter.deriveLabels();
IO::writeSegmentationVideo(out_dir / boost::filesystem::path("0"), sv_video);

// Visualize by randonly coloring segments.
IO::writeColoredSegmentationVideo(vis_dir / boost::filesystem::path("0"), sv_video);

// Each new hierarchy level, the threshold is raised by the factor 1.3.
GraphSegmentationHierarchyMagic* hmagic = new GraphSegmentationHierarchyMagicThreshold(c, 1.3);
GraphSegmentationHierarchyDistance* hdistance = 
        new GraphSegmentationHierarchyRGBChiSquareFlowAngle(alpha, beta);

segmenter.setHierarchyMagic(hmagic);
segmenter.setHierarchyDistance(hdistance);

for (int l = 0; l < L; l++) {
    // Build the region graph.
    segmenter.buildRegionGraph();

    // Segment the region graph.
    segmenter.addHierarchyLevel();

    // Enforce minimum segment size.
    segmenter.enforceMinimumSegmentSize(l/2 * M);

    SegmentationVideo sv_video = segmenter.deriveLabels();
    IO::writeSegmentationVideo(out_dir / boost::filesystem::path(std::to_string(l + 1)), 
            sv_video);

    IO::writeColoredSegmentationVideo(vis_dir / boost::filesystem::path(std::to_string(l + 1)), 
            sv_video);
}

The class IO located in io.h is intended to read (and write) videos available as sequence of images in a single directory. Flow needs to be available in a format compatible with cv::FileStorage, containing a single two-channel floating point matrix of the correct size. For oversegmentation, the algorithm is controlled by two important components:

GraphSegmentationDistance as for example GraphSegmentationEuclideanRGBFlowAngle responsible for calculating the edge weights based on the available information (i.e. color and flow);
GraphSegmentationMagic deciding when to merge two neighboring nodes - in particular, GraphSegmentationMagicThreshold uses the threshold mechanism as described above.

This modularity allows to modify the edge weights and the merging mechanism. Note that for using additional information as depth or storing further statistics, VideoNode, VideoGraph (both in video_graph.h) and GraphSegmentation::buildGraph (in graph_segmentation.h) may need to be adapted. Several different components deriving from the above classes are already provided in graph_segmentation.h.

For the hierarchical segmentation, similar components are available:

GraphSegmentationHierarchyDistance as for example GraphSegmentationHierarchyRGBChiSquareFlowAngle for calculating the edge weights as the $\chi^2$ distance of the RGB and flow histograms computed within the region graph;
GraphSegmentationHierarchyMagic deciding when to merge two segments - in particular, GraphSegmentationHierarchyMagicThreshold applies the same threshold mechanism discussed above with increasing threshold (e.g. the threshold is multiplied by $1.3$ each level).

Both components can be adapted - here, GraphSegmentation::buildRegionGraph may be interesting for integrating additional information. Several different components deriving from the above classes are already provided in graph_segmentation.h.

Results

Figure 1 shows some results of the algorithm as computed on the alley_1 sequence of the Sintel dataset [3] (10 frames of the sequence are provided by the GitHub repository).

Figure 1 (click to enlarge): Video segmentation of the alley_1 sequence. From top to bottom: original sequence; oversegmentation; hierarchical segmentation.

Results in terms of 3D Undersegmentation Error as discussed in this article will be made available within the next few weeks.

[1] M. Grundmann, V. Kwatra, M. Han, I. A. Essa. Efficient hierarchical graph-based video segmentation. Conference on Computer Vision and Pattern Recognition, pages 2141–2148, San Francisco, June 2010.
[2] P. F. Felzenszwalb, D. P. Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2):167-181, September 2004.
[3] D. J. Butler, J. Wulff, G. B. Stanley, M. J. Black. A naturalistic open source movie for optical flow evaluation. In Computer Vision, Conference on, volume 7577 of Lecture Notes in Computer Science, pages 611–625. Springer, October 2012.

IAM

DAVIDSTUTZ

ARTICLE

Introduction

Hierarchical Graph-Based Video Segmentation

P. F. Felzenswalb and D. P. Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision, volume 59, number 2, 2004.

Implementation

Results

SEARCHTHEBLOG

ARCHIVES

TAGS