Home / Advanced Search

  • Title/Keywords

  • Author/Affliations

  • Journal

  • Article Type

  • Start Year

  • End Year

Update SearchingClear
  • Articles
  • Online
Search Results (18)
  • Open Access

    ARTICLE

    Enhanced Image Captioning via Integrated Wavelet Convolution and MobileNet V3 Architecture

    Mo Hou1,2,3,#,*, Bin Xu4,#, Wen Shang1,2,3

    CMC-Computers, Materials & Continua, Vol.86, No.2, pp. 1-19, 2026, DOI:10.32604/cmc.2025.071282 - 09 December 2025

    Abstract Image captioning, a pivotal research area at the intersection of image understanding, artificial intelligence, and linguistics, aims to generate natural language descriptions for images. This paper proposes an efficient image captioning model named Mob-IMWTC, which integrates improved wavelet convolution (IMWTC) with an enhanced MobileNet V3 architecture. The enhanced MobileNet V3 integrates a transformer encoder as its encoding module and a transformer decoder as its decoding module. This innovative neural network significantly reduces the memory space required and model training time, while maintaining a high level of accuracy in generating image descriptions. IMWTC facilitates large receptive… More >

  • Open Access

    ARTICLE

    LREGT: Local Relationship Enhanced Gated Transformer for Image Captioning

    Yuting He, Zetao Jiang*

    CMC-Computers, Materials & Continua, Vol.84, No.3, pp. 5487-5508, 2025, DOI:10.32604/cmc.2025.065169 - 30 July 2025

    Abstract Existing Transformer-based image captioning models typically rely on the self-attention mechanism to capture long-range dependencies, which effectively extracts and leverages the global correlation of image features. However, these models still face challenges in effectively capturing local associations. Moreover, since the encoder extracts global and local association features that focus on different semantic information, semantic noise may occur during the decoding stage. To address these issues, we propose the Local Relationship Enhanced Gated Transformer (LREGT). In the encoder part, we introduce the Local Relationship Enhanced Encoder (LREE), whose core component is the Local Relationship Enhanced Module… More >

  • Open Access

    ARTICLE

    Optimizing Sentiment Integration in Image Captioning Using Transformer-Based Fusion Strategies

    Komal Rani Narejo1, Hongying Zan1,*, Kheem Parkash Dharmani2, Orken Mamyrbayev3,*, Ainur Akhmediyarova4, Zhibek Alibiyeva4, Janna Alimkulova5

    CMC-Computers, Materials & Continua, Vol.84, No.2, pp. 3407-3429, 2025, DOI:10.32604/cmc.2025.065872 - 03 July 2025

    Abstract While automatic image captioning systems have made notable progress in the past few years, generating captions that fully convey sentiment remains a considerable challenge. Although existing models achieve strong performance in visual recognition and factual description, they often fail to account for the emotional context that is naturally present in human-generated captions. To address this gap, we propose the Sentiment-Driven Caption Generator (SDCG), which combines transformer-based visual and textual processing with multi-level fusion. RoBERTa is used for extracting sentiment from textual input, while visual features are handled by the Vision Transformer (ViT). These features are More >

  • Open Access

    ARTICLE

    UniTrans: Unified Parameter-Efficient Transfer Learning and Multimodal Alignment for Large Multimodal Foundation Model

    Jiakang Sun1,2, Ke Chen1,2, Xinyang He1,2, Xu Liu1,2, Ke Li1,2, Cheng Peng1,2,*

    CMC-Computers, Materials & Continua, Vol.83, No.1, pp. 219-238, 2025, DOI:10.32604/cmc.2025.059745 - 26 March 2025

    Abstract With the advancements in parameter-efficient transfer learning techniques, it has become feasible to leverage large pre-trained language models for downstream tasks under low-cost and low-resource conditions. However, applying this technique to multimodal knowledge transfer introduces a significant challenge: ensuring alignment across modalities while minimizing the number of additional parameters required for downstream task adaptation. This paper introduces UniTrans, a framework aimed at facilitating efficient knowledge transfer across multiple modalities. UniTrans leverages Vector-based Cross-modal Random Matrix Adaptation to enable fine-tuning with minimal parameter overhead. To further enhance modality alignment, we introduce two key components: the Multimodal More >

  • Open Access

    REVIEW

    A Survey on Enhancing Image Captioning with Advanced Strategies and Techniques

    Alaa Thobhani1,*, Beiji Zou1, Xiaoyan Kui1,*, Amr Abdussalam2, Muhammad Asim3, Sajid Shah3, Mohammed ELAffendi3

    CMES-Computer Modeling in Engineering & Sciences, Vol.142, No.3, pp. 2247-2280, 2025, DOI:10.32604/cmes.2025.059192 - 03 March 2025

    Abstract Image captioning has seen significant research efforts over the last decade. The goal is to generate meaningful semantic sentences that describe visual content depicted in photographs and are syntactically accurate. Many real-world applications rely on image captioning, such as helping people with visual impairments to see their surroundings. To formulate a coherent and relevant textual description, computer vision techniques are utilized to comprehend the visual content within an image, followed by natural language processing methods. Numerous approaches and models have been developed to deal with this multifaceted problem. Several models prove to be state-of-the-art solutions… More >

  • Open Access

    ARTICLE

    Image Captioning Using Multimodal Deep Learning Approach

    Rihem Farkh1,*, Ghislain Oudinet1, Yasser Foued2

    CMC-Computers, Materials & Continua, Vol.81, No.3, pp. 3951-3968, 2024, DOI:10.32604/cmc.2024.053245 - 19 December 2024

    Abstract The process of generating descriptive captions for images has witnessed significant advancements in last years, owing to the progress in deep learning techniques. Despite significant advancements, the task of thoroughly grasping image content and producing coherent, contextually relevant captions continues to pose a substantial challenge. In this paper, we introduce a novel multimodal method for image captioning by integrating three powerful deep learning architectures: YOLOv8 (You Only Look Once) for robust object detection, EfficientNetB7 for efficient feature extraction, and Transformers for effective sequence modeling. Our proposed model combines the strengths of YOLOv8 in detecting objects,… More >

  • Open Access

    ARTICLE

    A Concise and Varied Visual Features-Based Image Captioning Model with Visual Selection

    Alaa Thobhani1,*, Beiji Zou1, Xiaoyan Kui1, Amr Abdussalam2, Muhammad Asim3, Naveed Ahmed4, Mohammed Ali Alshara4,5

    CMC-Computers, Materials & Continua, Vol.81, No.2, pp. 2873-2894, 2024, DOI:10.32604/cmc.2024.054841 - 18 November 2024

    Abstract Image captioning has gained increasing attention in recent years. Visual characteristics found in input images play a crucial role in generating high-quality captions. Prior studies have used visual attention mechanisms to dynamically focus on localized regions of the input image, improving the effectiveness of identifying relevant image regions at each step of caption generation. However, providing image captioning models with the capability of selecting the most relevant visual features from the input image and attending to them can significantly improve the utilization of these features. Consequently, this leads to enhanced captioning network performance. In light… More >

  • Open Access

    ARTICLE

    PCATNet: Position-Class Awareness Transformer for Image Captioning

    Ziwei Tang1, Yaohua Yi2,*, Changhui Yu2, Aiguo Yin3

    CMC-Computers, Materials & Continua, Vol.75, No.3, pp. 6007-6022, 2023, DOI:10.32604/cmc.2023.037861 - 29 April 2023

    Abstract Existing image captioning models usually build the relation between visual information and words to generate captions, which lack spatial information and object classes. To address the issue, we propose a novel Position-Class Awareness Transformer (PCAT) network which can serve as a bridge between the visual features and captions by embedding spatial information and awareness of object classes. In our proposal, we construct our PCAT network by proposing a novel Grid Mapping Position Encoding (GMPE) method and refining the encoder-decoder framework. First, GMPE includes mapping the regions of objects to grids, calculating the relative distance among… More >

  • Open Access

    ARTICLE

    Fine-Grained Features for Image Captioning

    Mengyue Shao1, Jie Feng1,*, Jie Wu1, Haixiang Zhang1, Yayu Zheng2

    CMC-Computers, Materials & Continua, Vol.75, No.3, pp. 4697-4712, 2023, DOI:10.32604/cmc.2023.036564 - 29 April 2023

    Abstract Image captioning involves two different major modalities (image and sentence) that convert a given image into a language that adheres to visual semantics. Almost all methods first extract image features to reduce the difficulty of visual semantic embedding and then use the caption model to generate fluent sentences. The Convolutional Neural Network (CNN) is often used to extract image features in image captioning, and the use of object detection networks to extract region features has achieved great success. However, the region features retrieved by this method are object-level and do not pay attention to fine-grained… More >

  • Open Access

    ARTICLE

    Enhanced Image Captioning Using Features Concatenation and Efficient Pre-Trained Word Embedding

    Samar Elbedwehy1,3,*, T. Medhat2, Taher Hamza3, Mohammed F. Alrahmawy3

    Computer Systems Science and Engineering, Vol.46, No.3, pp. 3637-3652, 2023, DOI:10.32604/csse.2023.038376 - 03 April 2023

    Abstract One of the issues in Computer Vision is the automatic development of descriptions for images, sometimes known as image captioning. Deep Learning techniques have made significant progress in this area. The typical architecture of image captioning systems consists mainly of an image feature extractor subsystem followed by a caption generation lingual subsystem. This paper aims to find optimized models for these two subsystems. For the image feature extraction subsystem, the research tested eight different concatenations of pairs of vision models to get among them the most expressive extracted feature vector of the image. For the More >

Displaying 1-10 on page 1 of 18. Per Page