"Unsupervised feature learning for 3D scene labeling", Lai et al. • David Stutz

JANUARY2018

READING

K. Lai, L. Bo, D. Fox. Unsupervised feature learning for 3D scene labeling. ICRA, 2014.

COMPUTER VISION

Lai et al. present an approach to point cloud labeling based on hierarchical, sparse-coded features. The model itself is a MRF model of the form

$E(y_i,\ldots, y_{|{V}|}) = \sum_{v \in {V}} \psi_v(y_v) + \sum_{(i,j) \in {N}} \psi_{i,j}(y_i,y_j)$

where $N$ is the set of pixel neighbors, $V$ the set of voxels after discretizting the point cloud and $y_i$ the label for voxel $i$ which is to be inferred. The data term is computed using two classifiers - one based on the point cloud and one based on the RGBD images:

$\psi_v(y_v) = - \ln p(y_v|\Omega_v) = \frac{1}{|\Omega_v|} \sum_{x \in \Omega_v} \alpha \ln p_{\text{vox}}(y_v|x) + (1 - \alpha) \ln p_{\text{im}}(y_v|x)$

Here, $p_{\text{vox}}$ is the probability distribution over labels as predicted by the classifier operating in the point cloud (i.e. voxel grid) and $p_{\text{im}}$ the corresponding distirbution inferred over RGBD images. For details on the used features and classifiers, see the paper. The pairwise term takes into account whether voxels are likely to lie on the same surface:

$\psi_{i,j} = \lambda \frac{\delta[y_i \neq y_j]}{d(n_i,n_j)} (I(n_i,n_j) + \epsilon)$

Here, $n_i$ denotes the normal corresponding to voxel $i$ and $d$ measures the Euclidean distance. Furthermore, $I(n_i,n_j)$ is an indicator function deciding whether $i$ and $j$ are part of the same convex surface or not.

The most interesting part of the paper is using matching pursuit to compute hierarchical features. In particular, they consider two layers. The first layer takes $5 \times 5 \times 5$ voxel patches and applies orthogonal matching pursuit to compute sparse codes of dimension M. These sparse codes are then pooled in $3 \times 3 \times 3$ cells, $2 \times 2 \times 2$ cells and 1 cell. The resulting representations are concatenated. The second layer samples patches of size 20 x 20 x 20 over the first layer features (i.e. $4 \times 4 \times 4$ patches of the first layer). The features are again pooled and concatenated. The resulting features are used for the voxel classifier as part of the data term.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.

IAM

DAVIDSTUTZ

READING

SEARCHTHEBLOG

ARCHIVES

TAGS