"Learning 3D Object Categories by Looking Around Them", Novotny et al. • David Stutz

MARCH2018

READING

David Novotný, Diane Larlus, Andrea Vedaldi. Learning 3D Object Categories by Looking Around Them. ICCV, 2017.

Novotny et al. introduce VpDR-Net, a network for joint viewpoint and depth estimation as well as shape completion by predicting completed point clouds. The high-level approach is highlighted in Figure. For details on the individual networks, I recommend reading the supplementary material of the paper. The proposed network basically consists of 3 components: a viewpoint estimation network, a depth estimation network, and a shape completion/point cloud completion network.

Figure 1: High-level overview of the proposed VpDR-Net.

An important aspect to discuss is the input to the network. As illustrated in Figure 1, given a sequence of images (i.e. a video) of an object, structure-from-motion is used to estimate a 3d reconstruction point cloud. If RGBD frames are available, KinectFusion [24] is used instead of structure-from-motion. Thus, the network gets as input the images including estimated depth, the estimated point cloud (for supervision) and the estimated egomotion.

The viewpoint estimation network shares its weights with the depth estimation network. The architecture is a siamese network where each of the two streams follows the ResNet50 [15] architecture followed by a hypercolum descriptor module [14] which directly predicts depth and viewpoint. The input to the siamese network are only two images of different frames. In order to predict viewpoints, there are several problems. First, in order to train the network, no absolute viewpoints are available – instead, viewpoints are only valid within individual sequences. However, Novotny et al. Intend to predict the absolute, sequence-independent viewpoint. Therefore, they supervise the network by relative viewpoint changes between frames within sequences – details can be found in the paper. Both for the viewpoint and the depth prediction, they introduce an uncertainty measure within the predictions. In particular, they let the network predict both the viewpoint, as well as the associated uncertainty encoded by a variance. Translation is then modeled by a Gaussian – such that the negative log-likelihood can be optimized; rotation is modeled as Laplacian. Note that the negative log-likelihoods are then governed by the predicted variance, and the loss (i.e. error) between prediction and ground truth. For depth, a similar approach is taken – depth is also modeled as Laplacian.

Finally, the point cloud completion network is based on PointNet [28]. In particular, it takes as input a partial point cloud obtained from the predicted viewpoints and depth maps, and predicts $M$ 3D points. These points should fit the ground truth point cloud as closely as possible – i.e. by minimizing the sum-of-squared error between nearest neighbors. Based on the 3D points, they also predict a probability distribution. For each point, a scalar is predicted indicating the fraction of ground truth points lying closest to the chosen point. The point cloud completion network is depicted in Figure 2.

The full system is trained in two stages, first the viewpoint and depth networks are optimized jointly, then the point cloud completion network is trained. Qualitative results of the approach can be seen in Figure 3. Note that at test time, the system allows to predict the shape (as point cloud) and the view point from a single image.

Figure 2: The point cloud completion network.

Figure 3: Qualitative results. Top row: input images, bottom rows: completed point cloud and predicted view point seen from two different sides.

[14] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In Proc. CVPR, 2015.

[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016.

[24] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In Proc. ISMAR, 2011.

[28] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. CoRR, abs/1612.00593, 2016.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.

SEARCHTHEBLOG

ARCHIVES

TAGS

2024 —
APRIL
MARCH
FEBRUARY
2023 —
NOVEMBER
OCTOBER
SEPTEMBER
JULY
JUNE
MAY
APRIL
MARCH
FEBRUARY
JANUARY
2022 —
NOVEMBER
OCTOBER
SEPTEMBER
AUGUST
2021 —
DECEMBER
NOVEMBER
OCTOBER
JULY
JUNE
MAY
APRIL
JANUARY
2020 —
NOVEMBER
AUGUST
JULY
JUNE
MAY
APRIL
MARCH
FEBRUARY
JANUARY
2019 —
DECEMBER
NOVEMBER
OCTOBER
SEPTEMBER
AUGUST
JULY
JUNE
APRIL
MARCH
FEBRUARY
JANUARY
2018 —
DECEMBER
OCTOBER
SEPTEMBER
AUGUST
JULY
JUNE
MAY
APRIL
MARCH
FEBRUARY
JANUARY
2017 —
DECEMBER
OCTOBER
SEPTEMBER
JULY
JUNE
MAY
APRIL
MARCH
FEBRUARY
JANUARY
2016 —
DECEMBER
NOVEMBER
OCTOBER
SEPTEMBER
AUGUST
JULY
JUNE
MAY
APRIL
FEBRUARY
JANUARY
2015 —
NOVEMBER
OCTOBER
JULY
JUNE
MAY
APRIL
MARCH
FEBRUARY
JANUARY
2014 —
DECEMBER
NOVEMBER
OCTOBER
AUGUST
JULY
JUNE
MAY
APRIL
MARCH
2013 —
DECEMBER
NOVEMBER
OCTOBER

ACADEMIA

ADVERSARIAL MACHINE LEARNING

ARTIFICIAL INTELLIGENCE

BASH

BLENDER

C

C++

CAFFE

CAREER

CMSIMPLE

COMPRESSED SENSING

COMPUTER GRAPHICS

COMPUTER SCIENCE

COMPUTER VISION

CSS

CUDA

D3JS

DATA MINING

DEEP LEARNING

DNN ACCELERATORS

GAME THEORY

HAIKU

HEALTH

IMAGE PROCESSING

JAVASCRIPT

JAX

JQUERY

KOHANA

LATEX

LUA

MACHINE LEARNING

MATHEMATICS

MATLAB

MEDIA COVERAGE

MEDICAL IMAGE PROCESSING

NATURAL LANGUAGE PROCESSING

NUMERICAL ANALYSIS

OPENCV

OPTIMIZATION

PHP

PODCAST

POSTER

PRODUCT

PUBLICATION

PYTHON

PYTORCH

RASPBERRY PI

RASPBIAN

RECORDING

ROBOTICS

SECURITY AND PRIVACY

SEMINAR

SOCIAL NETWORKS

SOFTWARE ENGINEERING

STATISTICS

TALK

TENSORFLOW

THEANO

THESIS

TORCH

TRUECRYPT

TWITTER BOOTSTRAP

UBUNTU

UNCERTAINTY ESTIMATION

WEB SECURITY

WORDPRESS

IAM

DAVIDSTUTZ

READING

SEARCHTHEBLOG

ARCHIVES

TAGS