Open Access
ARTICLE
CAFE-GAN: CLIP-Projected GAN with Attention-Aware Generation and Multi-Scale Discrimination
1 School of Communication and Information Engineering, Xi’an University of Posts and Telecomunications, Xi’an, 710121, China
2 Test Center, National University of Defense Technology, Xi’an, 710106, China
* Corresponding Author: Yijun Zhang. Email:
Computers, Materials & Continua 2026, 86(1), 1-19. https://doi.org/10.32604/cmc.2025.069482
Received 24 June 2025; Accepted 04 September 2025; Issue published 10 November 2025
Abstract
Over the past decade, large-scale pre-trained autoregressive and diffusion models rejuvenated the field of text-guided image generation. However, these models require enormous datasets and parameters, and their multi-step generation processes are often inefficient and difficult to control. To address these challenges, we propose CAFE-GAN, a CLIP-Projected GAN with Attention-Aware Generation and Multi-Scale Discrimination, which incorporates a pre-trained CLIP model along with several key architectural innovations. First, we embed a coordinate attention mechanism into the generator to capture long-range dependencies and enhance feature representation. Second, we introduce a trainable linear projection layer after the CLIP text encoder, which aligns textual embeddings with the generator’s semantic space. Third, we design a multi-scale discriminator that leverages pre-trained visual features and integrates a feature regularization strategy, thereby improving training stability and discrimination performance. Experiments on the CUB and COCO datasets demonstrate that CAFE-GAN outperforms existing text-to-image generation methods, achieving lower Fréchet Inception Distance (FID) scores and generating images with superior visual quality and semantic fidelity, with FID scores of 9.84 and 5.62 on the CUB and COCO datasets, respectively, surpassing current state-of-the-art text-to-image models by varying degrees. These findings offer valuable insights for future research on efficient, controllable text-to-image synthesis.Keywords
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools