TY - EJOU AU - Dewi, Christine AU - Chernovita, Hanna Prillysca AU - Philemon, Stephen Abednego AU - Ananta, Christian Adi AU - Chen, Abbott Po Shun TI - Adjusted Reasoning Module for Deep Visual Question Answering Using Vision Transformer T2 - Computers, Materials \& Continua PY - 2024 VL - 81 IS - 3 SN - 1546-2226 AB - Visual Question Answering (VQA) is an interdisciplinary artificial intelligence (AI) activity that integrates computer vision and natural language processing. Its purpose is to empower machines to respond to questions by utilizing visual information. A VQA system typically takes an image and a natural language query as input and produces a textual answer as output. One major obstacle in VQA is identifying a successful method to extract and merge textual and visual data. We examine “Fusion” Models that use information from both the text encoder and picture encoder to efficiently perform the visual question-answering challenge. For the transformer model, we utilize BERT and RoBERTa, which analyze textual data. The image encoder designed for processing image data utilizes ViT (Vision Transformer), Deit (Data-efficient Image Transformer), and BeIT (Image Transformers). The reasoning module of VQA was updated and layer normalization was incorporated to enhance the performance outcome of our effort. In comparison to the results of previous research, our proposed method suggests a substantial enhancement in efficacy. Our experiment obtained a 60.4% accuracy with the PathVQA dataset and a 69.2% accuracy with the VizWiz dataset. KW - VQA; vision transformer; multimodal data; deep learning DO - 10.32604/cmc.2024.057453