VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection

TLDR

Accurate 3D object detection in point clouds is critical for autonomous navigation, robotics, and AR, yet most methods rely on hand‑crafted features such as bird‑eye‑view projections. This work introduces VoxelNet, an end‑to‑end deep network that eliminates manual feature engineering by jointly extracting features and predicting 3D bounding boxes. VoxelNet voxelizes the point cloud, applies a voxel feature encoding layer to aggregate points within each voxel into a unified representation, and feeds this volumetric feature into a region proposal network for detection. On the KITTI benchmark, VoxelNet significantly surpasses existing LiDAR‑based 3D detectors and achieves strong performance for cars, pedestrians, and cyclists using only LiDAR data.

Abstract

Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.

References

Page 1

	Year	Citations

Page 1