Open Access iconOpen Access

ARTICLE

crossmark

CAFE-GAN: CLIP-Projected GAN with Attention-Aware Generation and Multi-Scale Discrimination

Xuanhong Wang1, Hongyu Guo1, Jiazhen Li1, Mingchen Wang1, Xian Wang1, Yijun Zhang2,*

1 School of Communication and Information Engineering, Xi’an University of Posts and Telecomunications, Xi’an, 710121, China
2 Test Center, National University of Defense Technology, Xi’an, 710106, China

* Corresponding Author: Yijun Zhang. Email: email

Computers, Materials & Continua 2026, 86(1), 1-19. https://doi.org/10.32604/cmc.2025.069482

Abstract

Over the past decade, large-scale pre-trained autoregressive and diffusion models rejuvenated the field of text-guided image generation. However, these models require enormous datasets and parameters, and their multi-step generation processes are often inefficient and difficult to control. To address these challenges, we propose CAFE-GAN, a CLIP-Projected GAN with Attention-Aware Generation and Multi-Scale Discrimination, which incorporates a pre-trained CLIP model along with several key architectural innovations. First, we embed a coordinate attention mechanism into the generator to capture long-range dependencies and enhance feature representation. Second, we introduce a trainable linear projection layer after the CLIP text encoder, which aligns textual embeddings with the generator’s semantic space. Third, we design a multi-scale discriminator that leverages pre-trained visual features and integrates a feature regularization strategy, thereby improving training stability and discrimination performance. Experiments on the CUB and COCO datasets demonstrate that CAFE-GAN outperforms existing text-to-image generation methods, achieving lower Fréchet Inception Distance (FID) scores and generating images with superior visual quality and semantic fidelity, with FID scores of 9.84 and 5.62 on the CUB and COCO datasets, respectively, surpassing current state-of-the-art text-to-image models by varying degrees. These findings offer valuable insights for future research on efficient, controllable text-to-image synthesis.

Keywords

Large vision language models; deep learning; computer vision; text-to-image generation

Cite This Article

APA Style
Wang, X., Guo, H., Li, J., Wang, M., Wang, X. et al. (2026). CAFE-GAN: CLIP-Projected GAN with Attention-Aware Generation and Multi-Scale Discrimination. Computers, Materials & Continua, 86(1), 1–19. https://doi.org/10.32604/cmc.2025.069482
Vancouver Style
Wang X, Guo H, Li J, Wang M, Wang X, Zhang Y. CAFE-GAN: CLIP-Projected GAN with Attention-Aware Generation and Multi-Scale Discrimination. Comput Mater Contin. 2026;86(1):1–19. https://doi.org/10.32604/cmc.2025.069482
IEEE Style
X. Wang, H. Guo, J. Li, M. Wang, X. Wang, and Y. Zhang, “CAFE-GAN: CLIP-Projected GAN with Attention-Aware Generation and Multi-Scale Discrimination,” Comput. Mater. Contin., vol. 86, no. 1, pp. 1–19, 2026. https://doi.org/10.32604/cmc.2025.069482



cc Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 757

    View

  • 300

    Download

  • 0

    Like

Share Link