DA-T3D: Distribution-Aware Cross-Modal Distillation Framework for Temporal 3D Object Detection
Tianzhe Jiao, Yuming Chen, Xiaoyue Feng, Chaopeng Guo, Jie Song*
Software College, Northeastern University, Shenyang, China
* Corresponding Author: Jie Song. Email: songjie@mail.neu.edu.cn
Computer Modeling in Engineering & Sciences https://doi.org/10.32604/cmes.2026.080595
Received 12 February 2026; Accepted 23 March 2026; Published online 08 April 2026
Abstract
Knowledge distillation bridges the performance gap between camera-based and LiDAR-based 3D
detectors by leveraging the precise geometric information from LiDAR. However, cross-modal knowledge transfer
remains challenging due to the inherent modality heterogeneity between LiDAR and camera data, which often leads to
instability during training. In this work, we find that these instabilities are closely related to distribution mismatch in
the cross-modal feature space and noisy teacher signals. To address this issue, we propose a novel distribution-aware
cross-modal distillation framework, named DA-T3D. Specifically, we first explicitly model the LiDAR teacher’s Bird’s-
Eye-View (BEV) feature distribution and use the learned distribution as a statistical prior to guide the student features
toward high-density and geometrically stable regions in the teacher’s BEV feature space. This ensures feature alignment
in BEV space by constraining the student model’s feature distribution to match that of the LiDAR teacher model within
foreground regions. Next, we further introduce response-level distillation to directly transfer the teacher’s prediction
behavior to the student detection head, providing direct output-space supervision that complements feature distillation
and effectively reduces modality-induced ambiguity, leading to more accurate and stable classification confidence
and bounding-box regression. Furthermore, we perform temporal modeling on the distilled cross-modal features to
produce fused BEV representations that capture more comprehensive scene context. Finally, we utilize the fused BEV
features to generate 3D detection results. Through experiments, we validate the effectiveness and superiority of DA-T3D
on the nuScenes dataset, achieving 46.7% mAP and 58.1% NDS.
Keywords
3D object detection; Bird’s-Eye-View perception; cross-modal knowledge distillation; Dirichlet process Gaussian mixture model; temporal modeling