M. Menze, A. Geiger. Object scene flow for autonomous vehicles. Conference on Computer Vision and Pattern Recognition, 2015.

Menze and Geiger propose a superpixel-based approach to object scene flow (see [1] or this reading for a quick introduction to scene flow) as well as a KITTI-based [2] dataset for scene flow. For scene flow estimation, a CRF-model of the form

$E(s, o) = \sum_{i} \phi_i(s_i, o) + \sum_{i \sim j} \psi_{ij}(s_i, s_j)$

is used. Here, $s = \{s_i\}$ denotes a set of superpixels, $o = \{o_i\}$ denotes a set of objects and $\sim$ the relation of adjacent superpixels. Each superpixel is modeled as plane; parameterized by the corresponding normal $n_i$. Further, the scene flow $o_i$ is inherited by the corresponding object a superpixel belongs to; note that objects and their scene flow are denoted by $o_i$. The association of superpixels to objects is estimated jointly with the corresponding scene flow.

Given left and right frames for time $t$ and $(t - 1)$, the data term in Equation (1) describes the matching cost of each superpixel:

$\phi_i(s_i, o) = \sum_{o_j \in o} \delta(k_i = j) \left(D_{i,\text{stereo}}(n_i, o_j) + D_{i,\text{flow}}(n_i, o_j) + D_{i, \text{cross}}(n_i, o_j)\right).$

where $k_i$ denotes the object association of superpixel $i$. The individual terms, i.e. the stereo term, the flow term and the cross term, sum the matching cost of each pixel inside the superpixel:

$D_{i,\text{x}}(n_i, o_j) = \sum_{p \in s_i} d(p, K(R_{\text{x}}(o_j) - t_{\text{x}}(o_j) n_i^T)K^{-1}p)$

where $d(p, p')$ denotes a dissimilarity/distance measure. For $\text{x} \in \{\text{stereo}, \text{flow}, \text{cross}\}$, $[R_{\text{x}}(o_j) | t_{\text{x}}(o_j)]$ denotes the rotation and translation to map a pixel in reference coordinates (Menze and Geiger use the left image at time $t$ as reference) to a pixel in one of the remaining coordinate systems according to the extrinsic camera parameters and the scene flow $o_j$. Menze and Geiger use both dense and sparse matching to define the dissimilarity $d$; for dense matching, the Hamming distance of $5 \times 5$ Census descriptors [4] are used and sparse correspondences are computed beforehand using [5] and [6]. The data term is also illustrated in Figure 1.


Figure 1 (click to enlarge): Illustration of the data term for a specific superpixel in the reference frame (bottom-left).

The second term in Equation (1) encodes smoothness regarding depth, orientation and motion:

$\psi_{ij}(s_i, s_j) = \alpha \psi_{ij,\text{depth}} (n_i, n_j) + \beta \psi_{ij,\text{orient}} (n_i, n_j) + \gamma \psi_{ij,\text{motion}} (s_i, s_j)$


$\psi_{ij,\text{depth}} (n_i, n_j) = \sum_{p \in B_{ij}} |\text{disp}(n_i, p) - \text{disp}(n_j, p)|$

$\psi_{ij,\text{orient}} (n_i, n_j) = \left|1 - \frac{|n_i^T n_j|}{\|n_i\|_2\|n_j\|_2}\right|$

$\psi_{ij,\text{motion}} (s_i, s_j) = \delta(k_i \neq k_j)\exp\left( - \frac{1}{|B_{i,j}|}\sum_{p \in B_{ij}} (\text{disp}(n_i, p) - \text{disp}(n_j, p))^2\right) \cdot \frac{|n_i^Tn_j|}{\|n_i\|_2 \|n_j\|_2}$

where $\text{disp}(n_i, p)$ describes the disparity of plane $n_i$ at pixel $p$.

Overall, the model relies on superpixels computed using StereoSLIC [3], optical flow as in [5] and disparity maps computed using [6]. Due to the association variables $k_i$, Equation (1) describes a discrete-continuous CRF and Menze and Geiger use particle belief propagation (see [7] or [8]). Details are given in the paper.

For evaluation, Menze and Geiger present a KITTI-based dataset for object scene flow by fitting a selected set of CAD models to the point clouds of identified objects. An illustration of the results is shown in Figure 2.


Figure 2 (click to enlarge): Estimated moving objects, scene flow results and ground truth.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: