Home / Advanced Search

  • Title/Keywords

  • Author/Affliations

  • Journal

  • Article Type

  • Start Year

  • End Year

Update SearchingClear
  • Articles
  • Online
Search Results (4)
  • Open Access

    ARTICLE

    Performance vs. Complexity Comparative Analysis of Multimodal Bilinear Pooling Fusion Approaches for Deep Learning-Based Visual Arabic-Question Answering Systems

    Sarah M. Kamel1,*, Mai A. Fadel2, Lamiaa Elrefaei1,3, Shimaa I. Hassan1,4

    CMES-Computer Modeling in Engineering & Sciences, Vol.143, No.1, pp. 373-411, 2025, DOI:10.32604/cmes.2025.062837 - 11 April 2025

    Abstract Visual question answering (VQA) is a multimodal task, involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer. In this paper, we propose a VQA system intended to answer yes/no questions about real-world images, in Arabic. To support a robust VQA system, we work in two directions: (1) Using deep neural networks to semantically represent the given image and question in a fine-grained manner, namely ResNet-152 and Gated Recurrent Units (GRU). (2) Studying the role of the utilized multimodal bilinear… More >

  • Open Access

    ARTICLE

    Adjusted Reasoning Module for Deep Visual Question Answering Using Vision Transformer

    Christine Dewi1,3, Hanna Prillysca Chernovita2, Stephen Abednego Philemon1, Christian Adi Ananta1, Abbott Po Shun Chen4,*

    CMC-Computers, Materials & Continua, Vol.81, No.3, pp. 4195-4216, 2024, DOI:10.32604/cmc.2024.057453 - 19 December 2024

    Abstract Visual Question Answering (VQA) is an interdisciplinary artificial intelligence (AI) activity that integrates computer vision and natural language processing. Its purpose is to empower machines to respond to questions by utilizing visual information. A VQA system typically takes an image and a natural language query as input and produces a textual answer as output. One major obstacle in VQA is identifying a successful method to extract and merge textual and visual data. We examine “Fusion” Models that use information from both the text encoder and picture encoder to efficiently perform the visual question-answering challenge. For More >

  • Open Access

    ARTICLE

    Improving VQA via Dual-Level Feature Embedding Network

    Yaru Song*, Huahu Xu, Dikai Fang

    Intelligent Automation & Soft Computing, Vol.39, No.3, pp. 397-416, 2024, DOI:10.32604/iasc.2023.040521 - 11 July 2024

    Abstract Visual Question Answering (VQA) has sparked widespread interest as a crucial task in integrating vision and language. VQA primarily uses attention mechanisms to effectively answer questions to associate relevant visual regions with input questions. The detection-based features extracted by the object detection network aim to acquire the visual attention distribution on a predetermined detection frame and provide object-level insights to answer questions about foreground objects more effectively. However, it cannot answer the question about the background forms without detection boxes due to the lack of fine-grained details, which is the advantage of grid-based features. In… More >

  • Open Access

    ARTICLE

    WMA: A Multi-Scale Self-Attention Feature Extraction Network Based on Weight Sharing for VQA

    Yue Li, Jin Liu*, Shengjie Shang

    Journal on Big Data, Vol.3, No.3, pp. 111-118, 2021, DOI:10.32604/jbd.2021.017169 - 22 November 2021

    Abstract Visual Question Answering (VQA) has attracted extensive research focus and has become a hot topic in deep learning recently. The development of computer vision and natural language processing technology has contributed to the advancement of this research area. Key solutions to improve the performance of VQA system exist in feature extraction, multimodal fusion, and answer prediction modules. There exists an unsolved issue in the popular VQA image feature extraction module that extracts the fine-grained features from objects of different scale difficultly. In this paper, a novel feature extraction network that combines multi-scale convolution and self-attention More >

Displaying 1-10 on page 1 of 4. Per Page