Open Access
ARTICLE
DA-T3D: Distribution-Aware Cross-Modal Distillation Framework for Temporal 3D Object Detection
Software College, Northeastern University, Shenyang, China
* Corresponding Author: Jie Song. Email:
(This article belongs to the Special Issue: Advanced Image Segmentation and Object Detection: Innovations, Challenges, and Applications)
Computer Modeling in Engineering & Sciences 2026, 147(1), 1 https://doi.org/10.32604/cmes.2026.080595
Received 12 February 2026; Accepted 23 March 2026; Issue published 27 April 2026
Abstract
Knowledge distillation bridges the performance gap between camera-based and LiDAR-based 3D detectors by leveraging the precise geometric information from LiDAR. However, cross-modal knowledge transfer remains challenging due to the inherent modality heterogeneity between LiDAR and camera data, which often leads to instability during training. In this work, we find that these instabilities are closely related to distribution mismatch in the cross-modal feature space and noisy teacher signals. To address this issue, we propose a novel distribution-aware cross-modal distillation framework, named DA-T3D. Specifically, we first explicitly model the LiDAR teacher’s Bird’s-Eye-View (BEV) feature distribution and use the learned distribution as a statistical prior to guide the student features toward high-density and geometrically stable regions in the teacher’s BEV feature space. This ensures feature alignment in BEV space by constraining the student model’s feature distribution to match that of the LiDAR teacher model within foreground regions. Next, we further introduce response-level distillation to directly transfer the teacher’s prediction behavior to the student detection head, providing direct output-space supervision that complements feature distillation and effectively reduces modality-induced ambiguity, leading to more accurate and stable classification confidence and bounding-box regression. Furthermore, we perform temporal modeling on the distilled cross-modal features to produce fused BEV representations that capture more comprehensive scene context. Finally, we utilize the fused BEV features to generate 3D detection results. Through experiments, we validate the effectiveness and superiority of DA-T3D on the nuScenes dataset, achieving 46.7% mAP and 58.1% NDS.Keywords
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools