Open Access
ARTICLE
DA-T3D: Distribution-Aware Cross-Modal Distillation Framework for Temporal 3D Object Detection
Software College, Northeastern University, Shenyang, China
* Corresponding Author: Jie Song. Email:
(This article belongs to the Special Issue: Advanced Image Segmentation and Object Detection: Innovations, Challenges, and Applications)
Computer Modeling in Engineering & Sciences 2026, 147(1), 1 https://doi.org/10.32604/cmes.2026.080595
Received 12 February 2026; Accepted 23 March 2026; Issue published 27 April 2026
Abstract
Knowledge distillation bridges the performance gap between camera-based and LiDAR-based 3D detectors by leveraging the precise geometric information from LiDAR. However, cross-modal knowledge transfer remains challenging due to the inherent modality heterogeneity between LiDAR and camera data, which often leads to instability during training. In this work, we find that these instabilities are closely related to distribution mismatch in the cross-modal feature space and noisy teacher signals. To address this issue, we propose a novel distribution-aware cross-modal distillation framework, named DA-T3D. Specifically, we first explicitly model the LiDAR teacher’s Bird’s-Eye-View (BEV) feature distribution and use the learned distribution as a statistical prior to guide the student features toward high-density and geometrically stable regions in the teacher’s BEV feature space. This ensures feature alignment in BEV space by constraining the student model’s feature distribution to match that of the LiDAR teacher model within foreground regions. Next, we further introduce response-level distillation to directly transfer the teacher’s prediction behavior to the student detection head, providing direct output-space supervision that complements feature distillation and effectively reduces modality-induced ambiguity, leading to more accurate and stable classification confidence and bounding-box regression. Furthermore, we perform temporal modeling on the distilled cross-modal features to produce fused BEV representations that capture more comprehensive scene context. Finally, we utilize the fused BEV features to generate 3D detection results. Through experiments, we validate the effectiveness and superiority of DA-T3D on the nuScenes dataset, achieving 46.7% mAP and 58.1% NDS.Keywords
3D object detection based on multi-view cameras is a fundamental yet challenging task in autonomous driving [1]. In real-world applications such as autonomous driving, accurate 3D object detection directly affects a vehicle’s ability to perceive the surrounding environment and make safe driving decisions. However, compared with LiDAR-based methods, camera-only methods often suffer from ambiguous depth estimation and are more sensitive to illumination variations and occlusions, which typically result in degraded 3D localization accuracy and limited robustness. To narrow the performance gap with LiDAR-based methods, researchers have increasingly explored cross-modal knowledge distillation in recent years. Specifically, cross-modal distillation transfers geometric priors from complementary modalities such as LiDAR to a camera-based student, providing reliable 3D structural cues to improve the 3D detection performance [2]. However, the inherent data heterogeneity between LiDAR point clouds and camera images poses challenges for effective cross-modal distillation.
To alleviate the distillation challenges caused by the modality gap between LiDAR and cameras, existing methods typically map data from both modalities into a unified feature space to facilitate feature imitation [3]. Some studies project LiDAR points onto the image plane and perform distillation in 2D space [4]. However, such cross-modal transformations often lead to the loss of intrinsic features of the original data, which limits the student model’s ability to learn effective information from the teacher. Consequently, another mainstream method maps both modalities into a unified BEV space [5], enabling the student model to align features with the teacher more directly, as shown in Fig. 1. These works commonly adopt point-wise aligned distillation, which allows fine-grained matching between BEV features from the two modalities. Nevertheless, background regions in BEV space often contain substantial task-irrelevant noise, which can divert the distillation process toward redundant background features and reduce the efficiency of learning key foreground features. To address this issue, Chen et al. [6] proposed a foreground-aware distillation method that has been widely adopted. By focusing knowledge transfer on foreground target regions in the scene, it enhances the model’s ability to extract and transfer important features.

Figure 1: Cross-modal knowledge distillation frameworks.
Despite the promising progress of existing cross-modal distillation methods, the domain gap across modalities remains persists due to differences in imaging mechanisms and spatial resolution. In this context, adopting a point-wise aligned distillation scheme that enforces exact consistency between the BEV features of the two modalities may lead to noise amplification and overly restrictive constraints, thereby affecting the model’s detection performance. Moreover, distillation typically depends on high-quality supervisory signals from the teacher model. However, the teacher’s features may themselves contain noise and bias, for example, due to false positives, missed detections, or feature jitter. Such noise can be directly transferred to the student during distillation, leading to unstable supervision and reduced distillation effectiveness. Therefore, cross-modal knowledge distillation faces two core challenges: (1) due to inherent modality heterogeneity, using a simple point-to-point distillation method is suboptimal, and (2) the LiDAR teacher’s features can be noisy, so naive imitation may introduce erroneous supervision.
In this work, we propose a novel distribution-aware cross-modal distillation framework, which is a carefully designed distribution-level cross-modal distillation strategy that effectively addresses the aforementioned challenges. Specifically, our method first models class-conditional feature distributions of the LiDAR teacher’s BEV features. Then, using a distribution-consistency constraint, we encourage the student features to fall into the teacher’s high-density and geometrically stable regions, as shown in Fig. 1c. By aligning features at the distribution level, this method effectively narrows the BEV representation gap between the two modalities. Meanwhile, the modeling process naturally suppresses a small number of outlier and noisy teacher features. Distribution-level distillation pulls the student toward aggregated mode centers rather than individual noisy instances, thereby mitigating the adverse effects of teacher noise. In addition, to reduce interference from factors such as target occlusion and motion blur, we further apply lightweight temporal modeling to the distilled BEV features, improving training stability. The main contributions of this paper are as follows:
1. We propose a novel distribution-aware cross-modal distillation framework (DA-T3D) for 3D object detection, which enables distribution-level knowledge transfer from a LiDAR teacher to a camera-based student. In addition, we introduce response-level distillation to convey task-specific decision knowledge, further improving detection performance.
2. We propose a lightweight temporal fusion module that fuses features from two consecutive frames and introduces a gating mechanism to adaptively balance the contributions of the current and historical frames.
3. Through extensive experiments and ablation studies on the nuScenes benchmark, our framework demonstrates outstanding performance in 3D object detection. Our best model achieves 46.7% mAP and 58.1% NDS on nuScenes.
The remainder of this paper is organized as follows: Section 2 briefly reviews the related work. Section 3 introduces our proposed solutions in detail. Experimental settings and results, along with comparisons to baseline methods, are presented in Section 4 to validate the effectiveness of our approach. Finally, Section 5 presents the conclusion of this paper, summarizing the key contributions and discussing potential future directions.
2.1 Multi-View 3D Object Detection
Multi-view 3D object detection aims to leverage surround-view camera images to align and fuse multi-view 2D features into a unified 3D space or bird’s-eye-view (BEV) representation, thereby enabling 3D object localization and attribute regression. Existing methods mainly follow two paradigms: (1) explicitly constructing a dense BEV representation and then performing detection; and (2) adopting query-based or sparse 3D representations, where 3D queries directly aggregate information from multi-view features to regress 3D bounding boxes [7].
For explicit BEV construction, early studies achieved view transformation and feature fusion by predicting pixel-wise depth distributions (e.g., LSS [8]). Subsequent works have improved this pipeline along several directions, including depth estimation quality and temporal fusion. For example, BEVDepth introduces depth supervision [9], BEVFormer generates BEV features with spatiotemporal attention [10], and GeoBEV enhances geometric details via more efficient BEV sampling and structure-aware depth supervision [11]. In contrast, to avoid the computational overhead of dense BEV, query-based methods use 3D queries to interact with multi-view features. DETR3D samples features by projecting 3D reference points onto 2D views [12]. PETR and its variants strengthen spatial alignment with 3D positional embeddings [13], and Sparse4D aggregates multi-view and temporal information using 4D keypoints [14]. These methods have continually evolved to balance efficiency and accuracy, collectively advancing vision-only 3D detection. However, the performance of multi-view models heavily depends on the quality of depth estimation, lacks robustness to complex conditions such as illumination changes and adverse weather, and typically requires large amounts of accurately annotated data for supervised learning [9].
2.2 Multi-Modal 3D Object Detection
Multi-modal 3D object detection aims to fuse semantic and geometric information from sensors such as cameras, LiDAR, and radar to improve perception performance in complex scenarios. Existing methods can be categorized by fusion stage as early fusion, feature-level fusion, and late fusion. Mainstream directions include BEV-based unified representations, sparse query–based fusion, and unified 3D representations, enabling better cross-modal complementarity [3,15].
Specifically, early fusion injects image semantics directly into point clouds or voxels, as in PointPainting [16] and MVX-Net [17]. However, it is sensitive to calibration errors and point cloud sparsity. Subsequent works such as PPF-Net improve robustness via region-level semantic aggregation [18]. Feature-level fusion maps multi-modal features into a shared BEV space for interaction, with BEVFusion providing a lightweight fusion framework [19,20]. Late fusion performs cross-modal fusion after generating candidate boxes, as in MV3D [21] and CLOCs [22], but the degree of cross-modal interaction is limited. In addition, to improve efficiency and long-range performance, MV2DFusion adopts a sparse query–based fusion scheme, using object queries as carriers for cross-modal interaction [23]. To address sensor disparities, unified 3D representation methods such as FGU3R convert images into pseudo point clouds to enable fine-grained fusion [24]. Although multi-modal fusion methods can effectively mitigate inherent limitations of unimodal methods in depth estimation and robustness under adverse weather conditions [15,16], they face challenges in deployment cost and computational overhead introduced by multiple sensors.
2.3 Cross-Modal Knowledge Distillation for 3D Object Detection
Cross-modal knowledge distillation (CMKD) for 3D object detection aims to use a stronger, information-rich modality (e.g., LiDAR or multimodal fusion) during training to guide a weaker-modality detector (e.g., camera-only or radar-only). In this way, inference can rely solely on low-cost sensors, striking a balance between deployment efficiency and accuracy. Existing studies mainly focus on key issues such as modality representation gaps, spatial alignment, and noise in teacher-generated pseudo labels.
Early works such as MonoDistill [4] distill knowledge by projecting LiDAR features onto the image plane, improving spatial reasoning for monocular 3D detection. BEVDistill [6] and DistillBEV [25] further align image features with LiDAR teacher predictions in BEV space to enhance camera-based BEV detection. UniDistill [26] proposes a generic BEV-oriented CMKD framework that transfers knowledge at multiple levels, including features, predictions, and relations. To alleviate the high cost of 3D annotations, MonoLiG [27] and SCKD [28] combine CMKD with semi-supervised learning, using teacher-generated pseudo labels to train student models and suppressing noisy negative transfer via uncertainty weighting, feature distillation, and related techniques, thus moving CMKD from fully supervised to a semi-supervised training paradigm. In our method, the student is attracted toward dominant modes rather than individual noisy instances. This robustness mechanism is difficult to obtain from moment matching alone, which treats all samples implicitly through aggregated statistics, and it is also less explicit in adversarial alignment, where unstable optimization may itself introduce additional training noise [29]. The effectiveness of cross-modal distillation depends heavily on the teacher model’s representational capacity and the accuracy of cross-modal spatial alignment. Calibration errors or large modality discrepancies can easily lead to feature misalignment and negative transfer. To this end, we propose a distribution-level cross-modal distillation method to effectively address the above challenges.
In this section, we propose an innovative distribution-aware cross-modal distillation framework that transfers geometric knowledge from a LiDAR-based teacher model to a multi-view camera student model, improving camera-only 3D object detection. Unlike mainstream point-to-point feature regression for BEV distillation, we model the teacher features with a probabilistic distribution and regularize the student features by enforcing distribution-level consistency. This method alleviates distillation instability caused by cross-modal feature distribution mismatches and noisy teacher signals.
As illustrated in Fig. 2, we first model the teacher’s BEV features within each foreground object region as a probabilistic distribution, and encourage the student features to fall into its high-density regions. This strategy couples the supervision strength with the statistical uncertainty of the teacher features, automatically reweighting different feature dimensions. We impose stronger supervision on more stable feature directions, while appropriately relaxing the constraints on directions that are more variable. In this way, the student progressively aligns with the teacher’s BEV feature distribution in an overall statistical sense, effectively narrowing the cross-modality feature gap in BEV space. Moreover, distribution-level distillation tends to pull the student toward the aggregated centers of dominant modes rather than individual noisy instances, thus mitigating the adverse impact of teacher noise without modifying the student architecture. Subsequently, we further introduce response distillation to refine output-level supervision and improve distillation quality. Notably, although distillation methods are effective at extracting and transferring knowledge, they cannot eliminate information loss at the physical level. To address this limitation, we incorporate temporal modeling to compensate for missing observations in the current frame by fusing information from historical frames.

Figure 2: A cross-modal knowledge distillation framework integrating LiDAR and camera modalities for enhanced BEV object detection.
3.2 Distribution-Aware Cross-Modal Distillation Framework
Previous BEV feature distillation methods [6,30] typically use a foreground mask to select target-relevant regions on the BEV plane and perform point-wise alignment between the student and teacher features at these locations. This concentrates the distillation on key spatial positions and reduces interference from background noise. The distillation loss
where H and W are the height and width of the BEV feature map, respectively.
Although existing methods project both features maps onto the BEV plane to alleviate cross-view discrepancies, a domain gap still remains due to differences in imaging mechanisms and spatial resolution. Moreover, teacher features often contain noise and bias. Directly forcing the student to mimic the teacher’s feature maps can weaken the distillation effectiveness. To address this, we employ a Dirichlet Process Gaussian Mixture Model (DPGMM) to model the distribution of the teacher’s BEV features, approximating it as a mixture of Gaussian components. DPGMM can adaptively infer the effective number of active components for each class from the data, thereby avoiding per-class manual tuning and providing a more flexible prior for distribution-level distillation. Each component is parameterized by a mean and a covariance matrix, which describe the feature center and its variation across directions. This shifts teacher supervision from point-wise distillation to distribution-level distillation. We then introduce a distribution-consistency constraint to encourage the student features to match the teacher’s mixture distribution in a probabilistic manner. Compared with purely point-wise regression, our method provides stronger and more structure-aware supervision. It avoids noise amplification and overly restrictive constraints caused by point-to-point alignment, leading to more robust BEV feature transfer.
Teacher model. The teacher model adopts CenterPoint, a LiDAR-based 3D object detector that performs detection in the BEV space. Given an input LiDAR point cloud, it first quantizes the 3D space into regular bins (voxels or pillars) and encodes points within each bin into learned features. A standard LiDAR-based backbone network (e.g., VoxelNet [31] or PointPillars [32]) then produces a BEV feature map
Student model. The student model is based on BEVDepth [9], a camera-only BEV detector that explicitly lifts multi-view image features into the BEV space using depth-aware projection. It first extracts image features with a image backbone and predicts per-pixel depth distributions using a depth network. The features are then lifted to 3D space and projected onto a predefined BEV grid through a lift-splat-shoot operation, followed by a 2D BEV backbone for further encoding, producing the student BEV feature map
Distribution-aware feature distillation (DAFD). For each ground-truth 3D bounding box
where K is the number of foreground objects.
Because the feature distribution is highly class-dependent, mixing different semantic categories would lead to ambiguous high-density regions that provide misleading supervision for distillation. Therefore, we model each class
where
where
For each class
with categorical factors:
where
and the normalized responsibilities are
where
Using the collapsed sufficient statistics aggregated over all samples, we obtain the posterior hyperparameters
In rare cases, some mixture components are supported by only a few teacher features, which leads to unreliable density estimates. Enforcing distribution-aware distillation on such poorly-supported components may introduce noisy supervision. Therefore, we apply a tiny-component filter. For each class
Next, we extract student BEV features
In addition, we introduce a mixture-level regularization term to further align class-wise feature distributions across different modes:
To stabilize early training and ensure robustness, we include a standard pair-wise feature loss:
Thus, the final BEV feature distillation loss is defined as follows:
where

Response-Level Distillation. To transfer knowledge from the teacher’s detection head to the student’s detection head with the same architecture, we introduce response-level loss, which directly encourages the student head’s outputs to match the teacher’s responses. We also apply ground truth guided head distillation to prevent background dominated, uninformative locations from propagating noise.
For the classification branch, we distill the teacher’s soft responses in foreground regions and define the classification distillation term using a Gaussian focal loss, following [30]:
where
For the regression branch, following the training scheme of CenterPoint, we compute the regression distillation term using a
where
To summarize, we improve the camera-based student detector by distilling knowledge from a LiDAR teacher at two complementary levels. First, we perform distribution-aware feature distillation, which aligns the student’s BEV representations with the teacher via distribution-consistency constraints. Second, we apply response-level distillation on the detection head to further transfer the teacher’s prediction behavior, providing direct output-level guidance. These distillation objectives are jointly optimized with the student’s original training losses [9], including the standard 3D detection loss
where
3.4 Temporal Multi-View 3D Object Detection
While several existing methods achieve competitive 3D perception using a single image frame, relying solely on single-frame cues inevitably leads to performance bottlenecks. First, a single frame provides only static geometric and appearance information, which can result in unstable motion estimation. Second, objects that are occluded or only partially observed in one frame are more likely to be missed or localized inaccurately, hindering reliable detection. Incorporating temporal context improves the completeness and robustness of the representation. To this end, we introduce a lightweight, plug-and-play two-frame temporal fusion module that leverages distilled BEV features from the previous frame as historical compensation and injects cross-frame information into the current-frame representation through explicit alignment and adaptive fusion, thereby improving detection stability.
We take the current-frame BEV feature

Figure 3: Architecture of the two-frame temporal fusion model.
Specifically, we compute the relative transformation matrix
where
where
Unlike a cascaded two-stage warp (first obtaining
This design ensures that the entire alignment process performs only one interpolation, numerically avoiding the extra smoothing and amplification of systematic bias introduced by a second resampling.
Next, we introduce a pixel-wise gating weight
where
Finally, we aggregate information from the previous frame in a residual manner:
where
In this section, we present the evaluation setup, including the datasets used, evaluation metrics, and implementation details. We conduct a series of ablation studies and related analyses to thoroughly investigate the role and contribution of each component in our method. Finally, we perform comprehensive comparisons between our method and current state-of-the-art methods on widely used benchmark datasets.
We evaluate our method on the nuScenes datasets, covering diverse scenarios and sensor configurations.
nuScenes Dataset contains 1000 scenes (700 train, 150 val, 150 test) captured with 6 cameras and a 32-beam LiDAR at 20 Hz/10 Hz. Annotations include 1.4M 3D bounding boxes for 10 classes: car, truck, bus, trailer, construction vehicle, pedestrian, motorcycle, bicycle, barrier, traffic cone. We use the official metrics: nuScenes Detection Score (NDS), mean Average Precision (mAP), and 5 True Positive (TP) metrics: Average Translation Error (ATE), Average Scale Error (ASE), Average Orientation Error (AOE), Average Velocity Error (AVE), and Average Attribute Error (AAE). The NDS is calculated as follows:
Our framework is implemented using the MMDetection3D toolkit and trained on 4 NVIDIA GeForce RTX 4090 GPUs. We employ the AdamW optimizer with a cosine-scheduled learning rate of
In our DPGMM-based feature modeling, we set the tiny-component filter threshold to
4.3 Comparison with Other Models
We first report the main comparison results on the nuScenes validation set under the standard evaluation protocol. For a fair comparison, we group methods by backbone and input setting (image resolution and the number of frames), and summarize the overall 3D detection performance using the official metrics mAP and NDS. Table 1 compares our method with representative camera-based 3D detectors, and Table 2 further benchmarks different cross-modal distillation strategies under comparable student/teacher settings.
Table 1 compares our method with representative camera-based 3D object detection approaches on the nuScenes validation set, evaluated by mAP, NDS, and five TP error metrics. Overall, our approach achieves strong performance under two common settings: with ResNet50 at 256
Under the ResNet50 configuration, many competing methods operate with similar input resolution and typically 2 frames. Our method attains a higher NDS while simultaneously lowering geometry-related errors (mATE: 0.532, mASE: 0.223, mAOE: 0.398). These gains align with our design motivation: instead of enforcing strict point-wise matching, we perform distribution-aware cross-modal distillation that guides the student toward high-density and geometrically stable regions of the teacher feature space, which helps mitigate modality mismatch and suppress noisy teacher outliers. In addition, response-level distillation further transfers reliable decision behavior to the student, contributing to the overall quality improvements across TP metrics.
Some methods achieve strong performance by aggregating many historical frames (e.g., 16 + 1). In contrast, our model uses only 2 frames yet achieves competitive or better NDS and notably improved localization, demonstrating that lightweight temporal fusion with alignment and selective information injection can effectively compensate for occlusions and missing observations without relying on long sequences.
The ResNet101 results further confirm the scalability of our framework. With 2 frames, we achieve 0.467 mAP and 0.581 NDS, together with lower errors compared to the 2-frame baseline. These consistent gains support our conclusion that distribution-aware distillation provides robust geometric supervision under modality gaps, while the lightweight temporal design improves robustness in dynamic and occluded scenarios at low temporal overhead.
Table 2 compares our approach with representative cross-modal knowledge distillation and multi-modal baselines on the nuScenes validation set. Existing distillation methods typically reduce the camera–LiDAR gap via foreground feature imitation, label and response distillation, or multi-stage alignment. However, their gains can be affected by modality-induced distribution mismatch and noisy teacher signals (e.g., missed or false detections and feature jitter), which may limit robustness. In contrast, our method performs distribution-aware cross-modal distillation and combines it with a lightweight temporal design, aiming to transfer more stable geometric knowledge while keeping the student model efficient.
Under the ResNet50, 256
With a stronger backbone and higher resolution (ResNet101, 512

Figure 4: Visualization results of different cross modal knowledge distillation methods.
To better understand where the improvements come from, we conduct controlled ablations on the nuScenes validation set by enabling one component at a time. We use BEVDepth as the camera-only baseline student, adopt CenterPoint as the teacher for distillation, and employ a lightweight 2-frame temporal modeling strategy. The results in Table 3 progressively quantify the contribution of feature distillation, response distillation, and temporal modeling.

Table 3 reports an ablation study on the nuScenes validation set to quantify the contribution of each component in our framework. In Setting 1, the baseline achieves 0.412 mAP and 0.535 NDS. In Setting 2, after introducing feature distillation, the performance improves to 0.443 mAP and 0.565 NDS, indicating that intermediate BEV representation guidance from the LiDAR teacher helps narrow the modality gap and provides more reliable geometric cues for the camera model, thereby improving spatial feature quality and overall 3D detection performance. Further enabling response distillation on top of feature distillation yields 0.451 mAP and 0.572 NDS, bringing consistent gains. This suggests that output-level supervision complements feature-level alignment by refining the student’s prediction distribution, which improves the final detection heads. Finally, incorporating 2-frame temporal modeling achieves the best performance of 0.467 mAP and 0.581 NDS. Temporal fusion aggregates complementary observations across consecutive frames, mitigating single-frame noise and partial occlusions and producing more stable BEV features and more consistent localization, which is reflected in both mAP and NDS. In order to more intuitively demonstrate the difference between the baseline model and the optimal setting, we visualized their inference results, as shown in Fig. 5.

Figure 5: Visualization results of baseline and our method.
To further investigate the contribution of the proposed distribution-aware feature distillation objective, we additionally perform a loss-level ablation study, as reported in Table 4.

Table 4 further analyzes the effect of each loss term in the proposed distribution-aware feature distillation. Starting from the pair-wise loss
We further analyze the design of the temporal modeling module by ablating its key components, including the learnable refinement and the motion-aware gating mechanism, as shown in Table 5.

Table 5 presents an ablation study of each component in the temporal modeling module. Here, the baseline denotes the basic two-frame fusion design with ego-motion compensation, while removing the learnable refinement offset and the motion-aware gating mechanism. Under this setting, the model achieves 0.460 mAP and 0.576 NDS, showing that simple temporal aggregation already provides useful historical context. After introducing the learnable refinement, the performance improves to 0.464 mAP and 0.579 NDS. This gain indicates that compensating for local misalignment beyond rigid ego-motion warping is beneficial, since discretization errors and dynamic scene variations cannot be fully handled by geometric transformation alone. When the motion-aware gating mechanism is further incorporated, the performance reaches 0.467 mAP and 0.581 NDS. This result shows that adaptively controlling the contribution of historical features is important for suppressing inconsistent or noisy temporal information, especially in regions affected by object motion or partial occlusion.
This paper presents a novel distribution-aware cross-modal distillation framework that transfers geometric priors from a LiDAR-based teacher to a camera-only student for temporal 3D object detection in the BEV space. To address distillation instability caused by modality heterogeneity and noisy teacher features, we propose distribution-aware BEV feature distillation that explicitly models class-conditional BEV feature distributions of the teacher using a DPGMM and constrains student features to match the teacher’s distribution in a probabilistic manner. Next, we introduce response-level distillation to transfer task-specific decision behavior at the detection head, improving output calibration and localization refinement. Furthermore, we design a lightweight two-frame temporal fusion module with ego-motion compensation, residual alignment refinement, and motion-aware gating to robustly aggregate complementary observations from consecutive frames. Although our study achieves promising results, the proposed framework may be less effective in adverse environments (e.g., low light, rain, or fog) where both camera and LiDAR signals degrade, making the teacher’s predictions unreliable and the student’s inputs severely corrupted. In such cases, distillation may propagate erroneous supervision and reduce overall performance. In future work, we will systematically investigate robustness under severe sensor degradation. In addition, we will also focus on uncertainty issues caused by long-tail categories and complex motion patterns, and explore more adaptive mixture distribution modeling and uncertainty characterization methods.
Acknowledgement: None.
Funding Statement: This paper is supported by the National Natural Science Foundation of China (Grant No. 62302086).
Author Contributions: The authors confirm contribution to the paper as follows: conceptualization, Tianzhe Jiao and Jie Song; methodology, Tianzhe Jiao and Yuming Chen; software, Xiaoyue Feng ; validation, Yuming Chen, Tianzhe Jiao and Chaopeng Guo; formal analysis, Tianzhe Jiao; investigation, Yuming Chen; resources, Tianzhe Jiao; data curation, Xiaoyue Feng; writing—original draft preparation, Tianzhe Jiao; writing—review and editing, Jie Song; visualization, Yuming Chen; supervision, Jie Song; project administration, Chaopeng Guo; funding acquisition, Chaopeng Guo. All authors reviewed and approved the final version of the manuscript.
Availability of Data and Materials: The data that support the findings of this study are available from the Corresponding Author, upon reasonable request. The original data presented in the study are openly available in publicly accessible repositories: nuScenes at https://www.nuscenes.org/ and KITTI at http://www.cvlibs.net/datasets/kitti/eval_object.php.
Ethics Approval: Not applicable.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Wang Y, Jiang H, Chen G, Zhang T, Zhou J, Qing Z, et al. Efficient and robust multi-camera 3D object detection in Bird-Eye-View. Image Vis Comput. 2025;154:105428. doi:10.1016/j.imavis.2025.105428. [Google Scholar] [CrossRef]
2. Zhang C, Liang L, Zhou J, Xu Y. Multi-view depth estimation based on multi-feature aggregation for 3D reconstruction. Comput Graph. 2024;122(4):103954. doi:10.1016/j.cag.2024.103954. [Google Scholar] [CrossRef]
3. Li J, Xu J, Zhi M, Zhang J, Zhuo L. BEV-CMHF: a cross-modality hybrid fusion framework for BEV 3D object detection with feature interaction and temporal fusion, Early access. IEEE Trans Intell Transp Syst. 2026. doi:10.1109/TITS.2026.3651793. [Google Scholar] [CrossRef]
4. Chong Z, Ma X, Zhang H, Yue Y, Li H, Wang Z, et al. MonoDistill: learning spatial features for monocular 3D object detection. In: The Tenth International Conference on Learning Representations, ICLR 2022; 2022 Apr 25–29; Virtual Event. [Google Scholar]
5. Li Y, Chen Y, Qi X, Li Z, Sun J, Jia J. Unifying voxel-based representation with transformer for 3D object detection. Adv Neural Inf Process Syst. 2022;35:18442–55. [Google Scholar]
6. Chen Z, Li Z, Zhang S, Fang L, Jiang Q, Zhao F. BEVDistill: cross-modal BEV distillation for multi-view 3D object detection. In: The Eleventh International Conference on Learning Representations; 2023 [cited 2026 Jan 1]. Available from: https://openreview.net/forum?id=-2zfgNS917. [Google Scholar]
7. Liu J, Wang P, Liu W, Du S, Li Y, Hu J, et al. DMPG-BEV: diffusion model-based point clouds features generation for efficient camera-based BEV perception. IEEE Sens J. 2025;25(15):28905–18. [Google Scholar]
8. Philion J, Fidler S. Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Proceedings of the Computer Vision—ECCV 2020: 16th European Conference; 2020 Aug 23–28; Glasgow, UK. p. 194–210. [Google Scholar]
9. Li Y, Ge Z, Yu G, Yang J, Wang Z, Shi Y, et al. Bevdepth: acquisition of reliable depth for multi-view 3D object detection. Proc AAAI Conf Artif Intell. 2023;37(2):1477–85. [Google Scholar]
10. Li Z, Wang W, Li H, Xie E, Sima C, Lu T, et al. Bevformer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: European conference on computer vision. Berlin/Heidelberg, Germany: Springer; 2022. p. 1–18. [Google Scholar]
11. Zhang J, Zhang Y, Qi Y, Fu Z, Liu Q, Wang Y. GeoBEV: learning geometric BEV representation for multi-view 3D object detection. Proc AAAI Conf Artif Intell. 2025;39(9):9960–8. [Google Scholar]
12. Wang Y, Guizilini VC, Zhang T, Wang Y, Zhao H, Solomon J. Detr3d: 3D object detection from multi-view images via 3D-to-2D queries. Proc Mach Learn Res. 2022;164:180–91. [Google Scholar]
13. Liu Y, Yan J, Jia F, Li S, Gao Q, Wang T, et al. Petrv2: a unified framework for 3D perception from multi-camera images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023 Oct 2–3; Paris, France. p. 3262–72. [Google Scholar]
14. Lin X, Lin T, Pei Z, Huang L, Su Z. Sparse4D: multi-view 3D object detection with sparse spatial-temporal fusion. arXiv:2211.10581. 2022. [Google Scholar]
15. Chen J, Chen H, Zhu Z, Shen Z, Ling X, Zhu Y. DDCFusion: dynamic depth compensation fusion for camera–radar 3-D object detection. IEEE Sens J. 2026;26(3):4561–74. [Google Scholar]
16. Vora S, Lang AH, Helou B, Beijbom O. Pointpainting: sequential fusion for 3D object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020 Jun 13–19; Seattle, WA, USA. p. 4604–12. [Google Scholar]
17. Sindagi VA, Zhou Y, Tuzel O. MVX-Net: multimodal voxelNet for 3D object detection. In: Proceedings of the 2019 International Conference on Robotics and Automation (ICRA); 2019 May 20–24; Montreal, QC, Canada. p. 7276–82. [Google Scholar]
18. Zhang L, Li C. PPF-Net: efficient multimodal 3D object detection with pillar-point fusion. Electronics. 2025;14(4):685. [Google Scholar]
19. Liang T, Xie H, Yu K, Xia Z, Lin Z, Wang Y, et al. BEVFusion: a simple and robust LiDAR-camera fusion framework. Adv Neural Inf Process Syst. 2022;35:10421–34. [Google Scholar]
20. Liu Z, Tang H, Amini A, Yang X, Mao H, Rus DL, et al. Bevfusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. In: Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA); 2023 May 29–Jun 2; London, UK. p. 2774–81. [Google Scholar]
21. Chen X, Ma H, Wan J, Li B, Xia T. Multi-view 3D object detection network for autonomous driving. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017; 2017 Jul 21–26; Honolulu, HI, USA. p. 6526–34. [Google Scholar]
22. Pang S, Morris D, Radha H. CLOCs: camera-LiDAR object candidates fusion for 3D object detection. In: Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); 2020 Oct 24–2021 Jan 24; Las Vegas, NV, USA. p. 10386–93. [Google Scholar]
23. Wang Z, Huang Z, Gao Y, Wang N, Liu S. MV2DFusion: leveraging modality-specific object semantics for multi-modal 3D detection. IEEE Trans Pattern Anal Mach Intell. 2026;48(1):609–23. [Google Scholar] [PubMed]
24. Zhang G, Song Z, Liu L, Ou Z. FGU3R: fine-grained fusion via unified 3D representation for multimodal 3D object Detection. In: Proceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025; 2025 Apr 6–11; Hyderabad, India. p. 1–5. [Google Scholar]
25. Wang Z, Li D, Luo C, Xie C, Yang X. Distillbev: boosting multi-camera 3D object detection with cross-modal knowledge distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023 Oct 1–6; Paris, France. p. 8637–46. [Google Scholar]
26. Zhou S, Liu W, Hu C, Zhou S, Ma C. UniDistill: a universal cross-modality knowledge distillation framework for 3D object detection in bird’s-eye view. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023 Jun 17–24; Vancouver, BC, Canada. p. 5116–25. [Google Scholar]
27. Hekimoglu A, Schmidt M, Marcos-Ramiro A. Monocular 3D object detection with LiDAR guided semi supervised active learning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2024; 2024 Jan 3–8; Waikoloa, HI, USA. p. 2335–44. [Google Scholar]
28. Xu R, Xiang Z, Zhang C, Zhong H, Zhao X, Dang R, et al. SCKD: semi-supervised cross-modality knowledge distillation for 4D radar object detection. Proc AAAI Conf Artif Intell. 2025;39(9):8933–41. [Google Scholar]
29. Dong N, Zhang YQ, Ding ML, Xu SB, Bai YC. One-stage object detection knowledge distillation via adversarial learning. Appl Intell. 2022;52(4):4582–98. doi:10.1007/s10489-021-02634-6. [Google Scholar] [CrossRef]
30. Kim S, Kim Y, Hwang S, Jeong H, Kum D. LabelDistill: label-guided cross-modal knowledge distillation for camera-based 3D object detection. In: European Conference on Computer Vision. Berlin/Heidelberg, Germany: Springer; 2024. p. 19–37. [Google Scholar]
31. Zhou Y, Tuzel O. Voxelnet: end-to-end learning for point cloud based 3D object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018 Jun 18–23; Salt Lake City, UT, USA. p. 4490–9. [Google Scholar]
32. Lang AH, Vora S, Caesar H, Zhou L, Yang J, Beijbom O. PointPillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019; 2017 Jun 16–20; Long Beach, CA, USA. p. 12697–705. [Google Scholar]
33. Huang J, Huang G. BEVDet4D: exploit temporal cues in multi-camera 3D object detection. arXiv:2203.17054. 2022. [Google Scholar]
34. Huang J, Huang G, Zhu Z, Du D. BEVDet: high-performance multi-camera 3D object detection in bird-eye-view. arXiv:2112.11790. 2021. [Google Scholar]
35. Li Y, Bao H, Ge Z, Yang J, Sun J, Li Z. BEVStereo: enhancing depth estimation in multi-view 3D object detection with temporal stereo. Proc AAAI Conf Artif Intell. 2023;37(2):1486–94. [Google Scholar]
36. Yang C, Chen Y, Tian H, Tao C, Zhu X, Zhang Z, et al. BEVFormer v2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023; 2023 Jun 17–24; Vancouver, BC, Canada. p. 17830–9. [Google Scholar]
37. Park J, Xu C, Yang S, Keutzer K, Kitani KM, Tomizuka M, et al. Time will tell: new outlooks and a baseline for temporal multi-view 3D object detection. In: Proceedings of the The Eleventh International Conference on Learning Representations, ICLR 2023; 2023 May 1–5; Kigali, Rwanda. [Google Scholar]
38. Huang J, Huang G. BEVPoolv2: a cutting-edge implementation of BEVDet toward deployment. arXiv:2211.17111. 2022. [Google Scholar]
39. Wang S, Jiang X, Li Y. Focal-PETR: embracing foreground for efficient multi-camera 3D object detection. IEEE Trans Intell Veh. 2024;9(1):1481–9. [Google Scholar]
40. Liu Y, Wang T, Zhang X, Sun J. PETR: position embedding transformation for multi-view 3D object detection. In: Avidan S, Brostow GJ, Cissé M, Farinella GM, Hassner T, editors. Computer vision—ECCV 2022. Berlin/Heidelberg, Germany: Springer; 2022. p. 531–48. doi:10.1007/978-3-031-19812-0_31. [Google Scholar] [CrossRef]
41. Li Z, Wang W, Li H, Xie E, Sima C, Lu T, et al. BEVFormer: learning bird’s-eye-view representation from LiDAR-camera via spatiotemporal transformers. IEEE Trans Pattern Anal Mach Intell. 2025;47(3):2020–36. [Google Scholar]
42. Chen S, Wang X, Cheng T, Zhang Q, Huang C, Liu W. Polar parametrization for vision-based surround-view 3D detection. arXiv:2206.10965. 2022. [Google Scholar]
43. Yin T, Zhou X, Krähenbühl P. Center-based 3D object detection and tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021; 2021 Jun 19–25; Virtual. p. 11784–93. doi:10.1109/cvpr46437.2021.01161. [Google Scholar] [CrossRef]
44. Wang T, Zhu X, Pang J, Lin D. FCOS3D: fully convolutional one-stage monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2021; 2021 Oct 11–17; Montreal, QC, Canada. p. 913–22. [Google Scholar]
45. Wang T, Zhu X, Pang J, Lin D. Probabilistic and geometric depth: detecting objects in perspective. Proc Mach Learn Res. 2022;164:1475–85. [Google Scholar]
46. Nachkov A, Paudel DP, Danelljan M, Gool LV. Diffusion-based particle-DETR for BEV perception. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025; 2025 Feb 26–Mar 6; Tucson, AZ, USA. p. 2725–35. [Google Scholar]
47. Ye X, Yaman B, Cheng S, Tao F, Mallik A, Ren L. BEVDiffuser: plug-and-play diffusion model for BEV denoising with ground-truth guidance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025; 2025 Jun 11–15; Nashville, TN, USA. p. 1495–504. [Google Scholar]
48. Kwon D, Yoon Y, Son H, Kwak S. MemDistill: distilling LiDAR knowledge into memory for camera-only 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2025 Oct 19–23; Honolulu, HI, USA. p. 6828–38. [Google Scholar]
49. Klingner M, Borse S, Kumar VR, Rezaei B, Narayanan V, Yogamani SK, et al. X3KD: knowledge distillation across modalities, tasks and stages for multi-camera 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023; 2023 Jun 17–24; Vancouver, BC, Canada. p. 13343–53. [Google Scholar]
50. Wang Y, Solomon JM. Object DGCNN: 3D object detection using dynamic graphs. In: Proceedings of the Neural Information Processing Systems 2021, NeurIPS 2021; 2021 Dec 6–14; Virtual. p. 20745–58. [Google Scholar]
51. Wang S, Liu Y, Wang T, Li Y, Zhang X. Exploring object-centric temporal modeling for efficient multi-view 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2023 Oct 1–6; Paris, France. p. 3621–31. [Google Scholar]
52. Guo K, Ling Q. PromptDet: a lightweight 3D object detection framework with LiDAR prompts. Proc AAAI Conf Artif Intell. 2025;39(3):3266–74. [Google Scholar]
53. Huang P, Liu L, Zhang R, Zhang S, Xu X, Wang B, et al. TiG-BEV: multi-view BEV 3D object detection via target inner-geometry learning. arXiv:2212.13979. 2022. [Google Scholar]
54. Zhao H, Zhang Q, Zhao S, Chen Z, Zhang J, Tao D. SimDistill: simulated multi-modal distillation for BEV 3D object detection. Proc AAAI Conf Artif Intell. 2024;38(7):7460–8. doi:10.1609/aaai.v38i7.28577. [Google Scholar] [CrossRef]
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF

Downloads
Citation Tools