DA-T3D: Distribution-Aware Cross-Modal Distillation Framework for Temporal 3D Object Detection

Tianzhe Jiao, Yuming Chen, Xiaoyue Feng, Chaopeng Guo, Jie Song^*
Software College, Northeastern University, Shenyang, China
* Corresponding Author: Jie Song. Email: songjie@mail.neu.edu.cn

Computer Modeling in Engineering & Sciences https://doi.org/10.32604/cmes.2026.080595

Received 12 February 2026; Accepted 23 March 2026; Published online 08 April 2026

Download PDF

Abstract

Knowledge distillation bridges the performance gap between camera-based and LiDAR-based 3D detectors by leveraging the precise geometric information from LiDAR. However, cross-modal knowledge transfer remains challenging due to the inherent modality heterogeneity between LiDAR and camera data, which often leads to instability during training. In this work, we find that these instabilities are closely related to distribution mismatch in the cross-modal feature space and noisy teacher signals. To address this issue, we propose a novel distribution-aware cross-modal distillation framework, named DA-T3D. Specifically, we first explicitly model the LiDAR teacher’s Bird’s- Eye-View (BEV) feature distribution and use the learned distribution as a statistical prior to guide the student features toward high-density and geometrically stable regions in the teacher’s BEV feature space. This ensures feature alignment in BEV space by constraining the student model’s feature distribution to match that of the LiDAR teacher model within foreground regions. Next, we further introduce response-level distillation to directly transfer the teacher’s prediction behavior to the student detection head, providing direct output-space supervision that complements feature distillation and effectively reduces modality-induced ambiguity, leading to more accurate and stable classification confidence and bounding-box regression. Furthermore, we perform temporal modeling on the distilled cross-modal features to produce fused BEV representations that capture more comprehensive scene context. Finally, we utilize the fused BEV features to generate 3D detection results. Through experiments, we validate the effectiveness and superiority of DA-T3D on the nuScenes dataset, achieving 46.7% mAP and 58.1% NDS.