Huang and You use simple 3D convolutional networks for point cloud labeling. Given a big point cloud, e.g. consisting of a part of Ottawa, they extract individual point clouds by moving a center point through the point cloud and extracting a cubic bounding box with defined radius. The extracted point cloud is transformed to a voxelized occupancy grid used as input. The labels are inferred using a voting scheme for each voxel (as multiple labels can be present in each voxel). They claim to use $8000$ cells as input, which would correspond to $20 \times 20 \times 20$. This is, indeed, rather small, as they claim that 3D convolutional networks quickly reach the memory limit.The used network is rather simple and supposed to perform per-pixel semantic segmentation. Motivated by LeNet , the network consists of two 3D convolutional layers (where the convolutional layer is extended to 3D in a straight-forward way) and two 3D pooling layers, followed by a fully connected layer. This is illustrated in Figure 1. They present qualitative results in Figure 2.