TY  - EJOU
AU  - Jiao, Tianzhe 
AU  - Chen, Yuming 
AU  - Feng, Xiaoyue 
AU  - Guo, Chaopeng 
AU  - Song, Jie 

TI  - DA-T3D: Distribution-Aware Cross-Modal Distillation Framework for Temporal 3D Object Detection
T2  - Computer Modeling in Engineering \& Sciences

PY  - 2026
VL  - 147
IS  - 1
SN  - 1526-1506

AB  - Knowledge distillation bridges the performance gap between camera-based and LiDAR-based 3D detectors by leveraging the precise geometric information from LiDAR. However, cross-modal knowledge transfer remains challenging due to the inherent modality heterogeneity between LiDAR and camera data, which often leads to instability during training. In this work, we find that these instabilities are closely related to distribution mismatch in the cross-modal feature space and noisy teacher signals. To address this issue, we propose a novel distribution-aware cross-modal distillation framework, named DA-T3D. Specifically, we first explicitly model the LiDAR teacher’s Bird’s-Eye-View (BEV) feature distribution and use the learned distribution as a statistical prior to guide the student features toward high-density and geometrically stable regions in the teacher’s BEV feature space. This ensures feature alignment in BEV space by constraining the student model’s feature distribution to match that of the LiDAR teacher model within foreground regions. Next, we further introduce response-level distillation to directly transfer the teacher’s prediction behavior to the student detection head, providing direct output-space supervision that complements feature distillation and effectively reduces modality-induced ambiguity, leading to more accurate and stable classification confidence and bounding-box regression. Furthermore, we perform temporal modeling on the distilled cross-modal features to produce fused BEV representations that capture more comprehensive scene context. Finally, we utilize the fused BEV features to generate 3D detection results. Through experiments, we validate the effectiveness and superiority of DA-T3D on the nuScenes dataset, achieving 46.7% mAP and 58.1% NDS.
KW  - 3D object detection; Bird’s-Eye-View perception; cross-modal knowledge distillation; Dirichlet process Gaussian mixture model; temporal modeling

DO  - 10.32604/cmes.2026.080595