Open Access iconOpen Access

ARTICLE

crossmark

Optimizing Semantic and Texture Consistency in Video Generation

Xian Yu, Jianxun Zhang*, Siran Tian, Xiaobao He

College of Computer Science and Engineering, Chongqing University of Technology, Chongqing, 400054, China

* Corresponding Author: Jianxun Zhang. Email: email

Computers, Materials & Continua 2025, 85(1), 1883-1897. https://doi.org/10.32604/cmc.2025.065529

Abstract

In recent years, diffusion models have achieved remarkable progress in image generation. However, extending them to text-to-video (T2V) generation remains challenging, particularly in maintaining semantic consistency and visual quality across frames. Existing approaches often overlook the synergy between high-level semantics and low-level texture information, resulting in blurry or temporally inconsistent outputs. To address these issues, we propose Dual Consistency Training (DCT), a novel framework designed to jointly optimize semantic and texture consistency in video generation. Specifically, we introduce a multi-scale spatial adapter to enhance spatial feature extraction, and leverage the complementary strengths of CLIP and VGG—where CLIP focuses on high-level semantics and VGG captures fine-grained texture and detail. During training, a stepwise strategy is adopted to impose semantic and texture losses, constraining discrepancies between generated and ground-truth frames. Furthermore, we propose CLWS, which dynamically adjusts the balance between semantic and texture losses to facilitate more stable and effective optimization. Remarkably, DCT achieves high-quality video generation using only a single training video on a single NVIDIA A6000 GPU. Extensive experiments demonstrate that our method significantly improves temporal coherence and visual fidelity across various video generation tasks, verifying its effectiveness and generalizability.

Keywords

Diffusion model; dynamic weighting; text-to-video; one-shot

Cite This Article

APA Style
Yu, X., Zhang, J., Tian, S., He, X. (2025). Optimizing Semantic and Texture Consistency in Video Generation. Computers, Materials & Continua, 85(1), 1883–1897. https://doi.org/10.32604/cmc.2025.065529
Vancouver Style
Yu X, Zhang J, Tian S, He X. Optimizing Semantic and Texture Consistency in Video Generation. Comput Mater Contin. 2025;85(1):1883–1897. https://doi.org/10.32604/cmc.2025.065529
IEEE Style
X. Yu, J. Zhang, S. Tian, and X. He, “Optimizing Semantic and Texture Consistency in Video Generation,” Comput. Mater. Contin., vol. 85, no. 1, pp. 1883–1897, 2025. https://doi.org/10.32604/cmc.2025.065529



cc Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 2409

    View

  • 2134

    Download

  • 0

    Like

Share Link