Home / Journals / CMC / Online First / doi:10.32604/cmc.2025.071282
Special Issues
Table of Content

Open Access

ARTICLE

Enhanced Image Captioning via Integrated Wavelet Convolution and MobileNet V3 Architecture

Mo Hou1,2,3,#,*, Bin Xu4,#, Wen Shang1,2,3
1 Jiangsu Collaborative Innovation Center for Language Ability, School of Linguistic Sciences and Arts, Jiangsu Normal University, Xuzhou, 221116, China
2 Linguistic Science Laboratory, School of Linguistic Sciences and Arts, Jiangsu Normal University, Xuzhou, 221116, China
3 Laboratory of Philosophy and Social Sciences at Universities in Jiangsu Province, School of Linguistic Sciences and Arts, Jiangsu Normal University, Xuzhou, 221116, China
4 School of Mathematics and Statistics, Jiangsu Normal University, Xuzhou, 221116, China
* Corresponding Author: Mo Hou. Email: email
# These authors contributed equally to this work

Computers, Materials & Continua https://doi.org/10.32604/cmc.2025.071282

Received 04 August 2025; Accepted 17 September 2025; Published online 20 October 2025

Abstract

Image captioning, a pivotal research area at the intersection of image understanding, artificial intelligence, and linguistics, aims to generate natural language descriptions for images. This paper proposes an efficient image captioning model named Mob-IMWTC, which integrates improved wavelet convolution (IMWTC) with an enhanced MobileNet V3 architecture. The enhanced MobileNet V3 integrates a transformer encoder as its encoding module and a transformer decoder as its decoding module. This innovative neural network significantly reduces the memory space required and model training time, while maintaining a high level of accuracy in generating image descriptions. IMWTC facilitates large receptive fields without significantly increasing the number of parameters or computational overhead. The improved MobileNet V3 model has its classifier removed, and simultaneously, it employs IMWTC layers to replace the original convolutional layers. This makes Mob-IMWTC exceptionally well-suited for deployment on low-resource devices. Experimental results, based on objective evaluation metrics such as BLEU, ROUGE, CIDEr, METEOR, and SPICE, demonstrate that Mob-IMWTC outperforms state-of-the-art models, including three CNN architectures (CNN-LSTM, CNN-Att-LSTM, CNN-Tran), two mainstream methods (LCM-Captioner, ClipCap), and our previous work (Mob-Tran). Subjective evaluations further validate the model’s superiority in terms of grammaticality, adequacy, logic, readability, and humanness. Mob-IMWTC offers a lightweight yet effective solution for image captioning, making it suitable for deployment on resource-constrained devices.

Keywords

Image caption; wavelet convolution; MobileNet V3; deep learning
  • 333

    View

  • 129

    Download

  • 0

    Like

Share Link