Progressive Multi-Modal Fusion for Robust 3D Object Detection

Abstract

Multi-sensor fusion is crucial for accurate 3D object detection in autonomous driving, with cameras and LiDAR being the most commonly used sensors. However, existing methods perform sensor fusion in a single view by projecting features from both modalities either in Bird's Eye View (BEV) or Perspective View (PV), thus sacrificing complementary information such as height or geometric proportions. To address this limitation, we propose ProFusion3D, a progressive fusion framework that combines features in both BEV and PV at both intermediate and object query levels. Our architecture hierarchically fuses local and global features, enhancing the robustness of 3D object detection. Additionally, we introduce a self-supervised mask modeling pre-training strategy to improve multi-modal representation learning and data efficiency through three novel objectives. Extensive experiments on nuScenes and Argoverse2 datasets conclusively demonstrate the efficacy of ProFusion3D. Moreover, ProFusion3D is robust to sensor failure, showing strong performance when only one modality is available.

Technical Approach

Figure:(a) Illustration of our proposed ProFusion3D architecture that employs progressive fusion and (b) The topology of our proposed fusion module.

Our proposed ProFusion3D architecture tackles the challenge of fusing LiDAR and camera inputs for accurate 3D object detection using a novel progressive fusion strategy. The framework first processes the LiDAR point clouds and multi-view camera images with separate encoders, resulting in Bird’s Eye View (BEV) features for LiDAR \(F_{\text{l_bev}}\) and Perspective View (PV) features for the camera \(F_{\text{c_pv}}\). To ensure comprehensive spatial understanding, we perform cross-view mapping between these feature spaces, transforming BEV features into PV features \(F_{\text{l_pv}}\) and vice versa \(F_{\text{c_bev}}\). This cross-view transformation enables the model to utilize both the fine-grained spatial details in the PV view and the structured geometric information in the BEV view, addressing limitations in previous single-view fusion methods.

Following the cross-view mapping, our architecture employs an inter-intra fusion module to hierarchically integrate the intra-modality and inter-modality features for each view space. Specifically, the intra-modality features, such as BEV LiDAR or PV camera features, are combined with their corresponding inter-modality counterparts (e.g., projected PV LiDAR features or BEV camera features) to capture both local and global spatial relationships. The fused BEV \(F_{\text{bev}}\) and PV \(F_{\text{pv}}\) representations are then processed by the decoder through an object query-level fusion strategy that allows view-specific feature refinement while maintaining a holistic scene understanding.

To further enhance the robustness and data efficiency of ProFusion3D, we introduce a multi-modal mask modeling pre-training strategy, which leverages asymmetrical masking and three distinct objectives: (i) masked token reconstruction to emphasize spatial feature learning, (ii) unmasked token denoising to capture fine-grained details, and (iii) cross-modal attribute prediction to utilize complementary information between modalities. During pre-training, the model learns to predict masked tokens from noisy or partial inputs, thereby improving its ability to fuse LiDAR and camera data effectively during downstream 3D object detection tasks.

Overall, ProFusion3D’s progressive fusion strategy, coupled with robust pre-training, enables superior performance on 3D detection benchmarks like nuScenes and Argoverse2, while ensuring robustness even under challenging conditions such as single-modality failure.