PCATNet: Position-Class Awareness Transformer for Image Captioning

Ziwei Tang; Yaohua Yi; Changhui Yu; Aiguo Yin

doi:10.32604/cmc.2023.037861

Open Access icon Open Access

ARTICLE

PCATNet: Position-Class Awareness Transformer for Image Captioning

Ziwei Tang¹, Yaohua Yi^2,*, Changhui Yu², Aiguo Yin³

1 Research Center of Graphic Communication, Printing and Packaging, Wuhan University, Wuhan, 430072, China
2 School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, 430072, China
3 Zhuhai Pantum Electronics Co., Ltd., Zhuhai, 519060, China

* Corresponding Author: Yaohua Yi. Email: email

Computers, Materials & Continua 2023, 75(3), 6007-6022. https://doi.org/10.32604/cmc.2023.037861

Received 18 November 2023; Accepted 07 March 2023; Issue published 29 April 2023

Abstract

Existing image captioning models usually build the relation between visual information and words to generate captions, which lack spatial information and object classes. To address the issue, we propose a novel Position-Class Awareness Transformer (PCAT) network which can serve as a bridge between the visual features and captions by embedding spatial information and awareness of object classes. In our proposal, we construct our PCAT network by proposing a novel Grid Mapping Position Encoding (GMPE) method and refining the encoder-decoder framework. First, GMPE includes mapping the regions of objects to grids, calculating the relative distance among objects and quantization. Meanwhile, we also improve the Self-attention to adapt the GMPE. Then, we propose a Classes Semantic Quantization strategy to extract semantic information from the object classes, which is employed to facilitate embedding features and refining the encoder-decoder framework. To capture the interaction between multi-modal features, we propose Object Classes Awareness (OCA) to refine the encoder and decoder, namely OCA_E and OCA_D, respectively. Finally, we apply GMPE, OCA_E and OCA_D to form various combinations and to complete the entire PCAT. We utilize the MSCOCO dataset to evaluate the performance of our method. The results demonstrate that PCAT outperforms the other competitive methods.

Keywords

Image captioning; relative position encoding; object classes awareness

Cite This Article

APA Style

Tang, Z., Yi, Y., Yu, C., Yin, A. (2023). PCATNet: Position-Class Awareness Transformer for Image Captioning. Computers, Materials & Continua, 75(3), 6007–6022. https://doi.org/10.32604/cmc.2023.037861

Vancouver Style

Tang Z, Yi Y, Yu C, Yin A. PCATNet: Position-Class Awareness Transformer for Image Captioning. Comput Mater Contin. 2023;75(3):6007–6022. https://doi.org/10.32604/cmc.2023.037861

IEEE Style

Z. Tang, Y. Yi, C. Yu, and A. Yin, “PCATNet: Position-Class Awareness Transformer for Image Captioning,” Comput. Mater. Contin., vol. 75, no. 3, pp. 6007–6022, 2023. https://doi.org/10.32604/cmc.2023.037861

BibTex EndNote RIS

Copyright © 2023 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

PCATNet: Position-Class Awareness Transformer for Image Captioning

Abstract

Keywords

Cite This Article

1868

1048

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link