Chen et al. propose a multiv-view convolutional neural network for 3D object detection – called Multi-View 3D Object Detection Network (MV3D). While the task is to predict 3D bounding boxes, the convolutional neural network operators on 2D projections of the LiDAR information. This allows the network to fuse LiDAR information and RGB information in a “deep fashion”.
Figure 1 (click to enlarge): Network architecture comprising the proposal network and the fusion network as well as illustrating the used input data.
The given LiDAR data is projected in two ways. First in bird’s view, and in front view. In both cases, different channels are hand-crafted. Together with the RGB information, the information is purely 2D. Chen et al. then discuss a 3D Proposal Network that operates purely on the bird’s eye and a region-based Fusion Network. Both are illustrated in Figure 1. While the general structure follows related work (e.g. R-CNN) in that the proposal network predicts objectness and bounding boxes and Region-of-Interest Pooling is then used to apply the classifier on top, a key contribution is the fusion network. Instead of fusing the different inputs (i.e. LiDAR and RGB) before the network (early fusion) or at the end of the network (late fusion), they propose to fuse the information in every step using element-wise mean operations (see Figure 1).
Figure 2 (click to enlarge): Qualitative results comparing 3DOP  (left), VeloFCN  (middle) and the proposed approach (right).
The presented experiments are based on VGG16  and the KITTI dataset . Qualitative results are shown in Figure 2. In an ablation study they demonstrate the advantage of using their fusion model and additionally adding auxiliary losses as regularization.