Check out our latest research on adversarial robustness and generalization of deep networks.


K. Lai, L. Bo, D. Fox. Unsupervised feature learning for 3D scene labeling. ICRA, 2014.

Lai et al. present an approach to point cloud labeling based on hierarchical, sparse-coded features. The model itself is a MRF model of the form

$E(y_i,\ldots, y_{|{V}|}) = \sum_{v \in {V}} \psi_v(y_v) + \sum_{(i,j) \in {N}} \psi_{i,j}(y_i,y_j)$

where $N$ is the set of pixel neighbors, $V$ the set of voxels after discretizting the point cloud and $y_i$ the label for voxel $i$ which is to be inferred. The data term is computed using two classifiers - one based on the point cloud and one based on the RGBD images:

$\psi_v(y_v) = - \ln p(y_v|\Omega_v) = \frac{1}{|\Omega_v|} \sum_{x \in \Omega_v} \alpha \ln p_{\text{vox}}(y_v|x) + (1 - \alpha) \ln p_{\text{im}}(y_v|x)$

Here, $p_{\text{vox}}$ is the probability distribution over labels as predicted by the classifier operating in the point cloud (i.e. voxel grid) and $p_{\text{im}}$ the corresponding distirbution inferred over RGBD images. For details on the used features and classifiers, see the paper. The pairwise term takes into account whether voxels are likely to lie on the same surface:

$\psi_{i,j} = \lambda \frac{\delta[y_i \neq y_j]}{d(n_i,n_j)} (I(n_i,n_j) + \epsilon)$

Here, $n_i$ denotes the normal corresponding to voxel $i$ and $d$ measures the Euclidean distance. Furthermore, $I(n_i,n_j)$ is an indicator function deciding whether $i$ and $j$ are part of the same convex surface or not.

The most interesting part of the paper is using matching pursuit to compute hierarchical features. In particular, they consider two layers. The first layer takes $5 \times 5 \times 5$ voxel patches and applies orthogonal matching pursuit to compute sparse codes of dimension M. These sparse codes are then pooled in $3 \times 3 \times 3$ cells, $2 \times 2 \times 2$ cells and 1 cell. The resulting representations are concatenated. The second layer samples patches of size 20 x 20 x 20 over the first layer features (i.e. $4 \times 4 \times 4$ patches of the first layer). The features are again pooled and concatenated. The resulting features are used for the voxel classifier as part of the data term.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below or get in touch with me: