TSF: Two-Stage Sequential Fusion for 3D Object Detection
A three-dimensional (3D) object detection system is essential for an effective perception system in autonomous driving. The current methods primarily use cameras and LiDAR to understand the environment comprehensively. Although camera images have rich color and texture features, they lack depth information and are sensitive to lighting changes. LiDAR-only PointRCNN may face challenges in identifying columnar structures and could miss car detections in sparse point clouds, particularly for distant vehicles.
Multimodal fusion methods are used to enhance the accuracy of 3D detection by integrating 2D image semantic knowledge to address limitations in point cloud data. Recent advancements in this field explore multimodal fusion techniques, including feature fusion, decision fusion, and sequential fusion. Sequential fusion integrates feature and decision fusion in a sequential way, acquiring decisions sequentially from one sensor, fusing them as features with another sensor, and completing the detection process. Despite its efficiency, sequential fusion lacks full integration of image detection results and doesn't fully leverage the potential of 3D detectors.
The paper presents TSF, a two-stage sequential fusion method for 3D object detection that efficiently fuses camera results with LiDAR information. In stage one, Nearest Group Painting (NGP) method, based on PointPainting, fuses segmentation labels, pixel scores, and raw point clouds, enhancing the overall point cloud quality. PointPainting incorporates semantic segmentation into point clouds via a 2D network, while NGP refines semantics through image instance segmentation, enhancing Point CNN-generated proposals. The resulting data can be processed by any 3D detector. It offers multimodal feature interaction for feature fusion while maintaining flexibility and modularity for decision fusion.
A two-stage 3D object-detection algorithm, PointRCNN was employed in the second stage. It assigns confidence scores and performs region proposal regression in the first stage, using confidence for foreground probability and proposals for object boundaries. To address the challenge of numerous background proposals with low confidence, PointRCNN employs confidence-based no-maximum suppression (C-NMS) based on confidence rank, filtering out low-confidence proposals. However, it inaccurately filters out low-confidence true positives and retains high-confidence false positives. For its remedy, a confidence-distance NMS (C-D NMS) was used. The distance-based NMS calculates Euclidean distances and sorts and significantly reduces proposal size before feeding them into PointRCNN's second stage for object detection.
The fusion model was trained and evaluated in the KITTI dataset of LiDAR point clouds, front-view RGB images, and calibration data. The dataset was categorized into three difficulty classes based on object size, visibility, and truncation. The NGP module's threshold radius was set at 4.5m for cars, 2m for cyclists, and 1m for pedestrians.
A comparison of different NMS strategies shows that the proposals generated by the NGP-based PointRCNN are more concentrated fitting objects, and the proposals generated by the C-D NMS can significantly solve the challenges posed by the CNMS. Incorporating NGP and C-D NMS into the entire network achieves an outstanding 83.60% 3D average precision, surpassing PointRCNN and other state-of-the-art methods on the 3D detection benchmark of the KITTI dataset. The proposed TSF module accurately detected blocked pedestrians and corrected the misidentification of vehicles.