M. Menze, A. Geiger. Object scene flow for autonomous vehicles. Conference on Computer Vision and Pattern Recognition, 2015.

Menze and Geiger propose a superpixel-based approach to object scene flow (see [1] or this reading for a quick introduction to scene flow) as well as a KITTI-based [2] dataset for scene flow. For scene flow estimation, a CRF-model of the form

$E(s, o) = \sum_{i} \phi_i(s_i, o) + \sum_{i \sim j} \psi_{ij}(s_i, s_j)$

is used. Here, $s = \{s_i\}$ denotes a set of superpixels, $o = \{o_i\}$ denotes a set of objects and $\sim$ the relation of adjacent superpixels. Each superpixel is modeled as plane; parameterized by the corresponding normal $n_i$. Further, the scene flow $o_i$ is inherited by the corresponding object a superpixel belongs to; note that objects and their scene flow are denoted by $o_i$. The association of superpixels to objects is estimated jointly with the corresponding scene flow.

Given left and right frames for time $t$ and $(t - 1)$, the data term in Equation (1) describes the matching cost of each superpixel:

$\phi_i(s_i, o) = \sum_{o_j \in o} \delta(k_i = j) \left(D_{i,\text{stereo}}(n_i, o_j) + D_{i,\text{flow}}(n_i, o_j) + D_{i, \text{cross}}(n_i, o_j)\right).$

where $k_i$ denotes the object association of superpixel $i$. The individual terms, i.e. the stereo term, the flow term and the cross term, sum the matching cost of each pixel inside the superpixel:

$D_{i,\text{x}}(n_i, o_j) = \sum_{p \in s_i} d(p, K(R_{\text{x}}(o_j) - t_{\text{x}}(o_j) n_i^T)K^{-1}p)$

where $d(p, p')$ denotes a dissimilarity/distance measure. For $\text{x} \in \{\text{stereo}, \text{flow}, \text{cross}\}$, $[R_{\text{x}}(o_j) | t_{\text{x}}(o_j)]$ denotes the rotation and translation to map a pixel in reference coordinates (Menze and Geiger use the left image at time $t$ as reference) to a pixel in one of the remaining coordinate systems according to the extrinsic camera parameters and the scene flow $o_j$. Menze and Geiger use both dense and sparse matching to define the dissimilarity $d$; for dense matching, the Hamming distance of $5 \times 5$ Census descriptors [4] are used and sparse correspondences are computed beforehand using [5] and [6]. The data term is also illustrated in Figure 1.


Figure 1 (click to enlarge): Illustration of the data term for a specific superpixel in the reference frame (bottom-left).

The second term in Equation (1) encodes smoothness regarding depth, orientation and motion:

$\psi_{ij}(s_i, s_j) = \alpha \psi_{ij,\text{depth}} (n_i, n_j) + \beta \psi_{ij,\text{orient}} (n_i, n_j) + \gamma \psi_{ij,\text{motion}} (s_i, s_j)$


$\psi_{ij,\text{depth}} (n_i, n_j) = \sum_{p \in B_{ij}} |\text{disp}(n_i, p) - \text{disp}(n_j, p)|$

$\psi_{ij,\text{orient}} (n_i, n_j) = \left|1 - \frac{|n_i^T n_j|}{\|n_i\|_2\|n_j\|_2}\right|$

$\psi_{ij,\text{motion}} (s_i, s_j) = \delta(k_i \neq k_j)\exp\left( - \frac{1}{|B_{i,j}|}\sum_{p \in B_{ij}} (\text{disp}(n_i, p) - \text{disp}(n_j, p))^2\right) \cdot \frac{|n_i^Tn_j|}{\|n_i\|_2 \|n_j\|_2}$

where $\text{disp}(n_i, p)$ describes the disparity of plane $n_i$ at pixel $p$.

Overall, the model relies on superpixels computed using StereoSLIC [3], optical flow as in [5] and disparity maps computed using [6]. Due to the association variables $k_i$, Equation (1) describes a discrete-continuous CRF and Menze and Geiger use particle belief propagation (see [7] or [8]). Details are given in the paper.

For evaluation, Menze and Geiger present a KITTI-based dataset for object scene flow by fitting a selected set of CAD models to the point clouds of identified objects. An illustration of the results is shown in Figure 2.


Figure 2 (click to enlarge): Estimated moving objects, scene flow results and ground truth.
What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.