Tech Science Press - Publisher of Open Access Journals

Open Access

ARTICLE

Enhanced Image Captioning via Integrated Wavelet Convolution and MobileNet V3 Architecture

Mo Hou^1,2,3,#,*, Bin Xu^4,#, Wen Shang^1,2,3

CMC-Computers, Materials & Continua, Vol.86, No.2, pp. 1-19, 2026, DOI:10.32604/cmc.2025.071282 - 09 December 2025

Abstract Image captioning, a pivotal research area at the intersection of image understanding, artificial intelligence, and linguistics, aims to generate natural language descriptions for images. This paper proposes an efficient image captioning model named Mob-IMWTC, which integrates improved wavelet convolution (IMWTC) with an enhanced MobileNet V3 architecture. The enhanced MobileNet V3 integrates a transformer encoder as its encoding module and a transformer decoder as its decoding module. This innovative neural network significantly reduces the memory space required and model training time, while maintaining a high level of accuracy in generating image descriptions. IMWTC facilitates large receptive… More >

Open Access

ARTICLE

LREGT: Local Relationship Enhanced Gated Transformer for Image Captioning

Yuting He, Zetao Jiang^*

CMC-Computers, Materials & Continua, Vol.84, No.3, pp. 5487-5508, 2025, DOI:10.32604/cmc.2025.065169 - 30 July 2025

Abstract Existing Transformer-based image captioning models typically rely on the self-attention mechanism to capture long-range dependencies, which effectively extracts and leverages the global correlation of image features. However, these models still face challenges in effectively capturing local associations. Moreover, since the encoder extracts global and local association features that focus on different semantic information, semantic noise may occur during the decoding stage. To address these issues, we propose the Local Relationship Enhanced Gated Transformer (LREGT). In the encoder part, we introduce the Local Relationship Enhanced Encoder (LREE), whose core component is the Local Relationship Enhanced Module… More >

Open Access

ARTICLE

Optimizing Sentiment Integration in Image Captioning Using Transformer-Based Fusion Strategies

Komal Rani Narejo¹, Hongying Zan^1,*, Kheem Parkash Dharmani², Orken Mamyrbayev^3,*, Ainur Akhmediyarova⁴, Zhibek Alibiyeva⁴, Janna Alimkulova⁵

CMC-Computers, Materials & Continua, Vol.84, No.2, pp. 3407-3429, 2025, DOI:10.32604/cmc.2025.065872 - 03 July 2025

Abstract While automatic image captioning systems have made notable progress in the past few years, generating captions that fully convey sentiment remains a considerable challenge. Although existing models achieve strong performance in visual recognition and factual description, they often fail to account for the emotional context that is naturally present in human-generated captions. To address this gap, we propose the Sentiment-Driven Caption Generator (SDCG), which combines transformer-based visual and textual processing with multi-level fusion. RoBERTa is used for extracting sentiment from textual input, while visual features are handled by the Vision Transformer (ViT). These features are More >

Open Access

ARTICLE

UniTrans: Unified Parameter-Efficient Transfer Learning and Multimodal Alignment for Large Multimodal Foundation Model

Jiakang Sun^1,2, Ke Chen^1,2, Xinyang He^1,2, Xu Liu^1,2, Ke Li^1,2, Cheng Peng^1,2,*

CMC-Computers, Materials & Continua, Vol.83, No.1, pp. 219-238, 2025, DOI:10.32604/cmc.2025.059745 - 26 March 2025

Abstract With the advancements in parameter-efficient transfer learning techniques, it has become feasible to leverage large pre-trained language models for downstream tasks under low-cost and low-resource conditions. However, applying this technique to multimodal knowledge transfer introduces a significant challenge: ensuring alignment across modalities while minimizing the number of additional parameters required for downstream task adaptation. This paper introduces UniTrans, a framework aimed at facilitating efficient knowledge transfer across multiple modalities. UniTrans leverages Vector-based Cross-modal Random Matrix Adaptation to enable fine-tuning with minimal parameter overhead. To further enhance modality alignment, we introduce two key components: the Multimodal More >

A Survey on Enhancing Image Captioning with Advanced Strategies and Techniques

Alaa Thobhani^1,*, Beiji Zou¹, Xiaoyan Kui^1,*, Amr Abdussalam², Muhammad Asim³, Sajid Shah³, Mohammed ELAffendi³

CMES-Computer Modeling in Engineering & Sciences, Vol.142, No.3, pp. 2247-2280, 2025, DOI:10.32604/cmes.2025.059192 - 03 March 2025

Abstract Image captioning has seen significant research efforts over the last decade. The goal is to generate meaningful semantic sentences that describe visual content depicted in photographs and are syntactically accurate. Many real-world applications rely on image captioning, such as helping people with visual impairments to see their surroundings. To formulate a coherent and relevant textual description, computer vision techniques are utilized to comprehend the visual content within an image, followed by natural language processing methods. Numerous approaches and models have been developed to deal with this multifaceted problem. Several models prove to be state-of-the-art solutions… More >

Open Access

ARTICLE

Image Captioning Using Multimodal Deep Learning Approach

Rihem Farkh^1,*, Ghislain Oudinet¹, Yasser Foued²

CMC-Computers, Materials & Continua, Vol.81, No.3, pp. 3951-3968, 2024, DOI:10.32604/cmc.2024.053245 - 19 December 2024

Abstract The process of generating descriptive captions for images has witnessed significant advancements in last years, owing to the progress in deep learning techniques. Despite significant advancements, the task of thoroughly grasping image content and producing coherent, contextually relevant captions continues to pose a substantial challenge. In this paper, we introduce a novel multimodal method for image captioning by integrating three powerful deep learning architectures: YOLOv8 (You Only Look Once) for robust object detection, EfficientNetB7 for efficient feature extraction, and Transformers for effective sequence modeling. Our proposed model combines the strengths of YOLOv8 in detecting objects,… More >

Open Access

ARTICLE

A Concise and Varied Visual Features-Based Image Captioning Model with Visual Selection

Alaa Thobhani^1,*, Beiji Zou¹, Xiaoyan Kui¹, Amr Abdussalam², Muhammad Asim³, Naveed Ahmed⁴, Mohammed Ali Alshara^4,5

CMC-Computers, Materials & Continua, Vol.81, No.2, pp. 2873-2894, 2024, DOI:10.32604/cmc.2024.054841 - 18 November 2024

Abstract Image captioning has gained increasing attention in recent years. Visual characteristics found in input images play a crucial role in generating high-quality captions. Prior studies have used visual attention mechanisms to dynamically focus on localized regions of the input image, improving the effectiveness of identifying relevant image regions at each step of caption generation. However, providing image captioning models with the capability of selecting the most relevant visual features from the input image and attending to them can significantly improve the utilization of these features. Consequently, this leads to enhanced captioning network performance. In light… More >

Open Access

ARTICLE

PCATNet: Position-Class Awareness Transformer for Image Captioning

Ziwei Tang¹, Yaohua Yi^2,*, Changhui Yu², Aiguo Yin³

CMC-Computers, Materials & Continua, Vol.75, No.3, pp. 6007-6022, 2023, DOI:10.32604/cmc.2023.037861 - 29 April 2023

Abstract Existing image captioning models usually build the relation between visual information and words to generate captions, which lack spatial information and object classes. To address the issue, we propose a novel Position-Class Awareness Transformer (PCAT) network which can serve as a bridge between the visual features and captions by embedding spatial information and awareness of object classes. In our proposal, we construct our PCAT network by proposing a novel Grid Mapping Position Encoding (GMPE) method and refining the encoder-decoder framework. First, GMPE includes mapping the regions of objects to grids, calculating the relative distance among… More >

Open Access

ARTICLE

Fine-Grained Features for Image Captioning

Mengyue Shao¹, Jie Feng^1,*, Jie Wu¹, Haixiang Zhang¹, Yayu Zheng²

CMC-Computers, Materials & Continua, Vol.75, No.3, pp. 4697-4712, 2023, DOI:10.32604/cmc.2023.036564 - 29 April 2023

Abstract Image captioning involves two different major modalities (image and sentence) that convert a given image into a language that adheres to visual semantics. Almost all methods first extract image features to reduce the difficulty of visual semantic embedding and then use the caption model to generate fluent sentences. The Convolutional Neural Network (CNN) is often used to extract image features in image captioning, and the use of object detection networks to extract region features has achieved great success. However, the region features retrieved by this method are object-level and do not pay attention to fine-grained… More >

Open Access

ARTICLE

Enhanced Image Captioning Using Features Concatenation and Efficient Pre-Trained Word Embedding

Samar Elbedwehy^1,3,*, T. Medhat², Taher Hamza³, Mohammed F. Alrahmawy³

Computer Systems Science and Engineering, Vol.46, No.3, pp. 3637-3652, 2023, DOI:10.32604/csse.2023.038376 - 03 April 2023

Abstract One of the issues in Computer Vision is the automatic development of descriptions for images, sometimes known as image captioning. Deep Learning techniques have made significant progress in this area. The typical architecture of image captioning systems consists mainly of an image feature extractor subsystem followed by a caption generation lingual subsystem. This paper aims to find optimized models for these two subsystems. For the image feature extraction subsystem, the research tested eight different concatenations of pairs of vision models to get among them the most expressive extracted feature vector of the image. For the More >

Displaying 1-10 on page 1 of 18. Per Page

View

1061

Download

363

View

870

Download

532

View

1537

Download

431

View

1539

Download

646

View

4337

Download

1186

View

1956

Download

922

View

1469

Download

1232

View

1497

Download

866

View

2051

Download

1039

View

1347

Download

1080

Like

2

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp: