Xian Yu, Jianxun Zhang*, Siran Tian, Xiaobao He
CMC-Computers, Materials & Continua, Vol.85, No.1, pp. 1883-1897, 2025, DOI:10.32604/cmc.2025.065529
- 29 August 2025
Abstract In recent years, diffusion models have achieved remarkable progress in image generation. However, extending them to text-to-video (T2V) generation remains challenging, particularly in maintaining semantic consistency and visual quality across frames. Existing approaches often overlook the synergy between high-level semantics and low-level texture information, resulting in blurry or temporally inconsistent outputs. To address these issues, we propose Dual Consistency Training (DCT), a novel framework designed to jointly optimize semantic and texture consistency in video generation. Specifically, we introduce a multi-scale spatial adapter to enhance spatial feature extraction, and leverage the complementary strengths of CLIP and More >