TY  - EJOU
AU  - Yu, Xian 
AU  - Zhang, Jianxun 
AU  - Tian, Siran 
AU  - He, Xiaobao 

TI  - Optimizing Semantic and Texture Consistency in Video Generation
T2  - Computers, Materials \& Continua

PY  - 2025
VL  - 85
IS  - 1
SN  - 1546-2226

AB  - In recent years, diffusion models have achieved remarkable progress in image generation. However, extending them to text-to-video (T2V) generation remains challenging, particularly in maintaining semantic consistency and visual quality across frames. Existing approaches often overlook the synergy between high-level semantics and low-level texture information, resulting in blurry or temporally inconsistent outputs. To address these issues, we propose Dual Consistency Training (DCT), a novel framework designed to jointly optimize semantic and texture consistency in video generation. Specifically, we introduce a multi-scale spatial adapter to enhance spatial feature extraction, and leverage the complementary strengths of CLIP and VGG—where CLIP focuses on high-level semantics and VGG captures fine-grained texture and detail. During training, a stepwise strategy is adopted to impose semantic and texture losses, constraining discrepancies between generated and ground-truth frames. Furthermore, we propose CLWS, which dynamically adjusts the balance between semantic and texture losses to facilitate more stable and effective optimization. Remarkably, DCT achieves high-quality video generation using only a single training video on a single NVIDIA A6000 GPU. Extensive experiments demonstrate that our method significantly improves temporal coherence and visual fidelity across various video generation tasks, verifying its effectiveness and generalizability.
KW  - Diffusion model; dynamic weighting; text-to-video; one-shot

DO  - 10.32604/cmc.2025.065529