End-to-End Multi-View Fusion for 3D Object Detection in LiDAR Point\n Clouds

Abstract

Recent work on 3D object detection advocates point cloud voxelization in\nbirds-eye view, where objects preserve their physical dimensions and are\nnaturally separable. When represented in this view, however, point clouds are\nsparse and have highly variable point density, which may cause detectors\ndifficulties in detecting distant or small objects (pedestrians, traffic signs,\netc.). On the other hand, perspective view provides dense observations, which\ncould allow more favorable feature encoding for such cases. In this paper, we\naim to synergize the birds-eye view and the perspective view and propose a\nnovel end-to-end multi-view fusion (MVF) algorithm, which can effectively learn\nto utilize the complementary information from both. Specifically, we introduce\ndynamic voxelization, which has four merits compared to existing voxelization\nmethods, i) removing the need of pre-allocating a tensor with fixed size; ii)\novercoming the information loss due to stochastic point/voxel dropout; iii)\nyielding deterministic voxel embeddings and more stable detection outcomes; iv)\nestablishing the bi-directional relationship between points and voxels, which\npotentially lays a natural foundation for cross-view feature fusion. By\nemploying dynamic voxelization, the proposed feature fusion architecture\nenables each point to learn to fuse context information from different views.\nMVF operates on points and can be naturally extended to other approaches using\nLiDAR point clouds. We evaluate our MVF model extensively on the newly released\nWaymo Open Dataset and on the KITTI dataset and demonstrate that it\nsignificantly improves detection accuracy over the comparable single-view\nPointPillars baseline.\n