Open Access iconOpen Access

ARTICLE

crossmark

LREGT: Local Relationship Enhanced Gated Transformer for Image Captioning

Yuting He, Zetao Jiang*

Guangxi Key Lab of Image and Graphic Intelligent Processing, Guilin University of Electronic Technology, Guilin, 541004, China

* Corresponding Author: Zetao Jiang. Email: email

Computers, Materials & Continua 2025, 84(3), 5487-5508. https://doi.org/10.32604/cmc.2025.065169

Abstract

Existing Transformer-based image captioning models typically rely on the self-attention mechanism to capture long-range dependencies, which effectively extracts and leverages the global correlation of image features. However, these models still face challenges in effectively capturing local associations. Moreover, since the encoder extracts global and local association features that focus on different semantic information, semantic noise may occur during the decoding stage. To address these issues, we propose the Local Relationship Enhanced Gated Transformer (LREGT). In the encoder part, we introduce the Local Relationship Enhanced Encoder (LREE), whose core component is the Local Relationship Enhanced Module (LREM). LREM consists of two novel designs: the Local Correlation Perception Module (LCPM) and the Local-Global Fusion Module (LGFM), which are beneficial for generating a comprehensive feature representation that integrates both global and local information. In the decoder part, we propose the Dual-level Multi-branch Gated Decoder (DMGD). It first creates multiple decoding branches to generate multi-perspective contextual feature representations. Subsequently, it employs the Dual-Level Gating Mechanism (DLGM) to model the multi-level relationships of these multi-perspective contextual features, enhancing their fine-grained semantics and intrinsic relationship representations. This ultimately leads to the generation of high-quality and semantically rich image captions. Experiments on the standard MSCOCO dataset demonstrate that LREGT achieves state-of-the-art performance, with a CIDEr score of 140.8 and BLEU-4 score of 41.3, significantly outperforming existing mainstream methods. These results highlight LREGT’s superiority in capturing complex visual relationships and resolving semantic noise during decoding.

Keywords

Image captioning; local relation enhancement; local correlation perception; dual-level gating mechanism

Cite This Article

APA Style
He, Y., Jiang, Z. (2025). LREGT: Local Relationship Enhanced Gated Transformer for Image Captioning. Computers, Materials & Continua, 84(3), 5487–5508. https://doi.org/10.32604/cmc.2025.065169
Vancouver Style
He Y, Jiang Z. LREGT: Local Relationship Enhanced Gated Transformer for Image Captioning. Comput Mater Contin. 2025;84(3):5487–5508. https://doi.org/10.32604/cmc.2025.065169
IEEE Style
Y. He and Z. Jiang, “LREGT: Local Relationship Enhanced Gated Transformer for Image Captioning,” Comput. Mater. Contin., vol. 84, no. 3, pp. 5487–5508, 2025. https://doi.org/10.32604/cmc.2025.065169



cc Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 764

    View

  • 486

    Download

  • 0

    Like

Share Link