# DAVIDSTUTZ

Check out the latest superpixel benchmark — Superpixel Benchmark (2016) — and let me know your opinion! @david_stutz
02ndMARCH2017

Yangyan Li, Sören Pirk, Hao Su, Charles Ruizhongtai Qi, Leonidas J. Guibas. FPNN: Field Probing Neural Networks for 3D Data. CoRR, abs/1605.06240.

Li et al. propose a novel, general framework for feeding (sparse) 3D data into deep neural networks. On the problem of 3D object classification (specifically the ModelNet40 dataset [1]), they test their proposed Field Probing Neural Networks (FPNN). The underlying idea is simple. Instead of densely convolving the 3D volume with filters (that are to be learned), they make a separate layer to learn the positions in the 3D data to use as data source. In particular, they propose field probing layers — these consist of three sub-layers: sensor layer, dot product layer and Gaussian layer.

The sensor layer has as parameters the weights and positions to "probe" for information: $\{(x_{c,n},y_{c,n},z_{c,n})\}$ where $c$ indexes one of $C$ filters and $n$ one of $N$ probing points. The input to the sensor layer is a 3D array with $T$ channels. Li et al. propose to use a distance transform as input (i.e. a 3D array containing the distance to the nearest surface at each location). This can be computed efficiently using occupancy grids computed from the given 3D data. The output of the sensor layer is a stack of $C\cdot N\cdot T$ values. The gradients for the backward pass, which are able to move the probing points, are computed using numerical differentiation given the 3D array.

The dot product layer computes the dot product of the given stack of $C\cdot N\cdot T$ values with as many weights. Note that this does not imply weight sharing. In this sense, there is a major difference between convolutional neural networks and field probing neural networks. As reason, Li et al. argue that the "information is not evenly distributed in 3D space". Still it would be interesting to see the influence (advantages/disadvantages) of weight sharing in this case.

The Gaussian layer applies a Gaussian transform (in general, a negative exponential) on the output of the sensor layer. The reasoning behind this layer is that probing points far away from the nearest surface corresponds to larger values in the distance transform used as input. Using the Gaussian transform, the impact of these large values on learning is reduced. Li et al. decided to use a separate layer. The backward pass is easily calculated from the derivative of the Gaussian transform. It is unclear why the input distance transforms were not pre-processed using a Gaussian transform to avoid the computational overhead of forward and backward pass during training.

The overall, general architecture is shown in Figure 1.

The model was trained using stochastic gradient descent on resolution $64 \times 64 \times 64$. Furthermore, while the work was motivated by the low resolution used by other authors, it is unclear why larger resolutions were not evaluated. The presented results are slightly worse than the approach in [2] using multi-view convolutional neural networks. Unfortunately, Li et al. do not discuss possible reasons for this discrepancy.

The implementation is part of Caffe and can be found on GitHub: yanganli/FPNN.

• [1]Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, J. Xiao. 3D shapenets: A deep representation for volumetric shapes. CVPR, 2015.
• [2] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, L. Guibas. Volumetric and multi-view cnns for object classification on 3D data. CVPR, 2016.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below or get in touch with me: