Open Access
ARTICLE
Enhanced Image Captioning via Integrated Wavelet Convolution and MobileNet V3 Architecture
1 Jiangsu Collaborative Innovation Center for Language Ability, School of Linguistic Sciences and Arts, Jiangsu Normal University, Xuzhou, 221116, China
2 Linguistic Science Laboratory, School of Linguistic Sciences and Arts, Jiangsu Normal University, Xuzhou, 221116, China
3 Laboratory of Philosophy and Social Sciences at Universities in Jiangsu Province, School of Linguistic Sciences and Arts, Jiangsu Normal University, Xuzhou, 221116, China
4 School of Mathematics and Statistics, Jiangsu Normal University, Xuzhou, 221116, China
* Corresponding Author: Mo Hou. Email:
# These authors contributed equally to this work
Computers, Materials & Continua 2026, 86(2), 1-19. https://doi.org/10.32604/cmc.2025.071282
Received 04 August 2025; Accepted 17 September 2025; Issue published 09 December 2025
Abstract
Image captioning, a pivotal research area at the intersection of image understanding, artificial intelligence, and linguistics, aims to generate natural language descriptions for images. This paper proposes an efficient image captioning model named Mob-IMWTC, which integrates improved wavelet convolution (IMWTC) with an enhanced MobileNet V3 architecture. The enhanced MobileNet V3 integrates a transformer encoder as its encoding module and a transformer decoder as its decoding module. This innovative neural network significantly reduces the memory space required and model training time, while maintaining a high level of accuracy in generating image descriptions. IMWTC facilitates large receptive fields without significantly increasing the number of parameters or computational overhead. The improved MobileNet V3 model has its classifier removed, and simultaneously, it employs IMWTC layers to replace the original convolutional layers. This makes Mob-IMWTC exceptionally well-suited for deployment on low-resource devices. Experimental results, based on objective evaluation metrics such as BLEU, ROUGE, CIDEr, METEOR, and SPICE, demonstrate that Mob-IMWTC outperforms state-of-the-art models, including three CNN architectures (CNN-LSTM, CNN-Att-LSTM, CNN-Tran), two mainstream methods (LCM-Captioner, ClipCap), and our previous work (Mob-Tran). Subjective evaluations further validate the model’s superiority in terms of grammaticality, adequacy, logic, readability, and humanness. Mob-IMWTC offers a lightweight yet effective solution for image captioning, making it suitable for deployment on resource-constrained devices.Keywords
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools