Sarah M. Kamel1,*, Mai A. Fadel2, Lamiaa Elrefaei1,3, Shimaa I. Hassan1,4
CMES-Computer Modeling in Engineering & Sciences, Vol.143, No.1, pp. 373-411, 2025, DOI:10.32604/cmes.2025.062837
- 11 April 2025
Abstract Visual question answering (VQA) is a multimodal task, involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer. In this paper, we propose a VQA system intended to answer yes/no questions about real-world images, in Arabic. To support a robust VQA system, we work in two directions: (1) Using deep neural networks to semantically represent the given image and question in a fine-grained manner, namely ResNet-152 and Gated Recurrent Units (GRU). (2) Studying the role of the utilized multimodal bilinear… More >