Check out our CVPR'18 paper on weakly-supervised 3D shape completion — and let me know your opinion! @david_stutz


Francis Engelmann, Jörg Stückler, Bastian Leibe. Joint Object Pose Estimation and Shape Reconstruction in Urban Street Scenes Using 3D Shape Priors. GCPR, 2016.

Engelmann et al. propose an approach for 3D pose estimation and shape reconstruction based on a 3D object detector. In particular, given 3D bounding boxes from 3DOP [3], pose and shape is estimated by fitting CAD models. To this end, a shape manifold is learned from the CAD models and, in a second step, the shape and pose is determined.

A 3D shape prior is modeled using PCA. To this end, the Truncated Signed Distance Function (TSDF) voxel grids derived from CAD models are embedded in a linear subspace through

$z(\tilde{\psi}) = V^T (\tilde{\psi}-\mu_{\tilde{\psi}}$

where $z(\tilde{\phi})$ denotes the encoding obtained from the stacked distance values $\tilde{\phi}$ and $\mu_{\tilde{\phi}}$ is the corresponding mean. The matrix $V$ is obtained through SVD decomposition. The learned manifold is illustrated in Figure 1.

Figure 1: Example shapes where the middle shape corresponds to the mean.

3D object detections are obtained from 3DOP and subsequently post-processed in order to segment the points belonging to the detected object. The 3D bounding box is then refined based on the segmentation and an initial pose estimate is derived from the bounding box. Shape and pose is than optimized jointly by minimizing

$E(\mathcal{X}, \xi, z) = -\frac{1}{N} \left(\sum_{i = 1}^N \log[p(x_i|\xi,z)]\right) - \log p(z) - \log p(\xi)$

where the observation likelihood $p(x_i | \xi, z)$ depends on the TSDF such that

$\log p(x_i |\xi, z) = \text{const} - \frac{1}{2\sigma_d^2} \rho(\psi(R(\theta)x_i + t, z))$

with $\rho = \|\cdot\|_\epsilon$ (Huber-norm). Here, $R(\theta)$ is the rotation corresponding to the pose $\xi$ and $t$ the corresponding translation. Deviations from the mean shape are penalized through the shape prior:

$\log p(z) = \text{const} - \frac{1}{2} \sum_{j = 1}^K \left(\frac{z_j}{\sigma_j}\right)^2$

where $\sigma_j^2$ is the eigenvalue of the $j$-th principal component as computed using PCA. Finally, the pose prior enforces the object to stand on the ground:

$\log p(\xi) = \text{const} - \frac{1}{2\sigma_y^2} (t_y - g(t))^2$

The overall objective is minimized using gradient descent.

The approach is compared to 3DOP as baseline for pose estimation. Unfortunately, no related work is used as baseline for the shape reconstruction. Qualitative results are shown in Figure 2.

Figure 2: Qualitative results showing 3DOP detections, the inferred shapes as well as the libELAS depth map and SPS-Stereo depth map (both left) with the improved depth map after shape reconstruction (both right).

  • [3] Chen, X., Kundu, K., Zhu, Y., Berneshawi, A., Ma, H., Fidler, S., Urtasun, R.: 3D object proposals for accurate object class detection. In: Proc. of Neural Information Processing Systems (NIPS) (2015).

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below or get in touch with me: