Optimizing Semantic and Texture Consistency in Video Generation

Xian Yu; Jianxun Zhang; Siran Tian; Xiaobao He

doi:10.32604/cmc.2025.065529

Open Access icon Open Access

ARTICLE

Optimizing Semantic and Texture Consistency in Video Generation

Xian Yu, Jianxun Zhang^*, Siran Tian, Xiaobao He

College of Computer Science and Engineering, Chongqing University of Technology, Chongqing, 400054, China

* Corresponding Author: Jianxun Zhang. Email: email

Computers, Materials & Continua 2025, 85(1), 1883-1897. https://doi.org/10.32604/cmc.2025.065529

Received 15 March 2025; Accepted 17 July 2025; Issue published 29 August 2025

Abstract

In recent years, diffusion models have achieved remarkable progress in image generation. However, extending them to text-to-video (T2V) generation remains challenging, particularly in maintaining semantic consistency and visual quality across frames. Existing approaches often overlook the synergy between high-level semantics and low-level texture information, resulting in blurry or temporally inconsistent outputs. To address these issues, we propose Dual Consistency Training (DCT), a novel framework designed to jointly optimize semantic and texture consistency in video generation. Specifically, we introduce a multi-scale spatial adapter to enhance spatial feature extraction, and leverage the complementary strengths of CLIP and VGG—where CLIP focuses on high-level semantics and VGG captures fine-grained texture and detail. During training, a stepwise strategy is adopted to impose semantic and texture losses, constraining discrepancies between generated and ground-truth frames. Furthermore, we propose CLWS, which dynamically adjusts the balance between semantic and texture losses to facilitate more stable and effective optimization. Remarkably, DCT achieves high-quality video generation using only a single training video on a single NVIDIA A6000 GPU. Extensive experiments demonstrate that our method significantly improves temporal coherence and visual fidelity across various video generation tasks, verifying its effectiveness and generalizability.

Keywords

Diffusion model; dynamic weighting; text-to-video; one-shot

Cite This Article

APA Style

Yu, X., Zhang, J., Tian, S., He, X. (2025). Optimizing Semantic and Texture Consistency in Video Generation. Computers, Materials & Continua, 85(1), 1883–1897. https://doi.org/10.32604/cmc.2025.065529

Vancouver Style

Yu X, Zhang J, Tian S, He X. Optimizing Semantic and Texture Consistency in Video Generation. Comput Mater Contin. 2025;85(1):1883–1897. https://doi.org/10.32604/cmc.2025.065529

IEEE Style

X. Yu, J. Zhang, S. Tian, and X. He, “Optimizing Semantic and Texture Consistency in Video Generation,” Comput. Mater. Contin., vol. 85, no. 1, pp. 1883–1897, 2025. https://doi.org/10.32604/cmc.2025.065529

BibTex EndNote RIS

Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Optimizing Semantic and Texture Consistency in Video Generation

Abstract

Keywords

Cite This Article

2653

2205

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link