COMPUTERVISION RESEARCHSCIENTIST

RESEARCHSCIENTIST

09^{th}OCTOBER2017

Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. *PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation*. CoRR abs/1612.00593, 2016.

What is **your opinion** on the summarized work? Or do you know related work that is of interest? **Let me know** your thoughts in the comments below or get in touch with me:

Qi et al. introduce PointNet, a network architecture able to learn directly from unordered point clouds. The architecture is based on several key ideas and they provide experiments tackling 3D object classification and 3d segmentation.

Figure 1: Proposed network architecture, see the text for details.

The high level architecture is shown in Figure 1 where both a classification and a segmentation network is shown (which share layers to some extend). The first idea is that a general function operating on a point set can be approximated by transforming the points individually and then applying a symmetric function on the transformed points:

$f(\{x_1,\ldots,x_n\}) \approx g(h(x_1),\ldots,h(x_n))$

Here, $f$ denotes the function to be approximated, $h$ is a point wise transformation and $g$ a symmetric function. In their case, $h$ is computed using a simple multi-layer perceptron (see Figure 1, referred to as "mlp") and $g$ is the max pooling operation. For segmentation, the second idea is to fuse both local information as well as global information. To this end, as illustrated in Figure 1, the input to the segmentation network consists of the set of transformed point features (in this case $n \times 64$ features) concatenated by the global feature extracted by the classification network using the max pooling operation. Finally, they jointly learn an alignment network on the input points as well as on the feature points ($n \times 3$ input and $n \times 64$ features in Figure 1). Therefore, they learn an affine transformation using a mini-network and directly apply this transformation to the coordinates or features of the points. To make training feasible (especially for the transformation network on the $n \times 64$ features) a regularizer is introduced:

$L_{\text{reg}} = \|I - AA^T\|_F^2$

where the transformation matrix is regularized to be close to orthogonal.

In a theoretical section, they show that the proposed network can act as a universal approximator (as known from general multi-layer perceptrons). To this end, they show that for each continuous function $f$ and for all errors $\epsilon$ there exist a function $h$ and a symmetric function $g = \gamma \circ \max$ such that

$\left| f(S) - \gamma(\max\{h(x_i)\})\right| < \epsilon$

In experiments they show that the proposed classification network demonstrates state-of-the-art performance in 3D object classification. They further show promising results in 3D semantic segmentation and also discuss several visualizations of the learned network. For example, Figure 2 shows points where the per point function $h$ (i.e. the computed values in the $1024$ dimensional per-point features in Figure 1) exceeds $0.5$; $15$ random per-point functions $h$ are displayed.

Figure 2. Visualization of $15$ random per point functions and all points where the function exceeds $0.5$.