Our proposed ProFusion3D architecture tackles the challenge of fusing LiDAR and camera inputs
for accurate 3D object detection using a novel progressive fusion strategy. The framework first
processes the LiDAR point clouds and multi-view camera images with separate encoders, resulting
in Bird’s Eye View (BEV) features for LiDAR \(F_{\text{l_bev}}\) and Perspective View (PV)
features for the camera \(F_{\text{c_pv}}\). To ensure comprehensive spatial understanding, we
perform cross-view mapping between these feature spaces, transforming BEV features into PV
features \(F_{\text{l_pv}}\) and vice versa \(F_{\text{c_bev}}\). This cross-view transformation
enables the model to utilize both the fine-grained spatial details in the PV view and the
structured geometric information in the BEV view, addressing limitations in previous single-view
fusion methods.
Following the cross-view mapping, our architecture employs an inter-intra fusion module to
hierarchically integrate the intra-modality and inter-modality features for each view space.
Specifically, the intra-modality features, such as BEV LiDAR or PV camera features, are combined
with their corresponding inter-modality counterparts (e.g., projected PV LiDAR features or BEV
camera features) to capture both local and global spatial relationships. The fused BEV
\(F_{\text{bev}}\) and PV \(F_{\text{pv}}\) representations are then processed by the decoder
through an object query-level fusion strategy that allows view-specific feature refinement while
maintaining a holistic scene understanding.
To further enhance the robustness and data efficiency of ProFusion3D, we introduce a multi-modal
mask modeling pre-training strategy, which leverages asymmetrical masking and three distinct
objectives: (i) masked token reconstruction to emphasize spatial feature learning, (ii) unmasked
token denoising to capture fine-grained details, and (iii) cross-modal attribute prediction to
utilize complementary information between modalities. During pre-training, the model learns to
predict masked tokens from noisy or partial inputs, thereby improving its ability to fuse LiDAR
and camera data effectively during downstream 3D object detection tasks.
Overall, ProFusion3D’s progressive fusion strategy, coupled with robust pre-training, enables
superior performance on 3D detection benchmarks like nuScenes and Argoverse2, while ensuring
robustness even under challenging conditions such as single-modality failure.